采用基于最大熵模型的方法对泰语句子级实体从属关系的抽取方法进行了研究.针对泰语句子中实体关系抽取的研究进程中语料库较为匮乏的问题,首先使用汉泰双语平行句对作为中间桥梁,将中文研究领域中相对成熟的分词、词性标注和实体识别等成果,通过汉泰双语词典映射到与中文句子相对齐的泰语句子上,对泰语句子进行必要的数据处理操作,并进行一定量的人工校正和人工实体关系标注工作;进而构建基础的泰语实体关系训练语料库.在语料库的基础上,将泰语实体关系抽取问题转化为分类问题,同时结合泰语语言本身的特点,选取合适的上下文特征模板,使用最大熵模型算法对训练语料进行学习训练,构建分类器,对泰语句子中的候选实体关系三元组进行识别,最终达到实体间从属关系自动抽取的目的.实验结果显示该方法可使F值相对于已有的泰语实体关系抽取研究方法提升8%左右.
The paper is aimed to extract affiliation relations between entities in the Thai research domain.An approach of the affiliation relations extraction between entities in sentences of Thai language based on the maximum entropy model is proposed.As for the deficience of corpus in the relation extraction process between entities in the sentences of Thai language,by making full use of the parallel sentence pairs of Chinese-Thai bilingual as an intermediate bridge,the comparative mature research findings in Chinese research domian,which including word segmentation,POS tagging,entity recognition and so on,will be mapped to the sentences of Thai language which corresponding to the sentences of Chinese with the help of Chinese-Thai bilingual dictionary.Then we operate several data processing procedures of Thai sentences and conduct appropriate manual amendments,as well as labeling the entity relationsamples manually.Consequently,the training corpus infrastructure of entity relations extraction in Thai language is built.On the basis of the corpus,we treat the entity relations extraction problem as a classification task.Given several particular characteristics of the Thai language itself,certain features templates in context of samples are extracted to train the maximum entropy model to be a useful classifier.Thus the model is able to recognize the class of triple tuples of candidate entity affiliation realtions to verify the efficiency and precision of the classifier in order to accomplish the task of the affiliation relations extraction between entities.The experiments show that the approach put forward in the paper can enable the F-measure to improve 8% approximately compared with the existing methods.