缺少标注语料和难以识别动词和名词类是阻碍中文专利最大名词短语识别的主要问题。针对上述问题,该文提出了一种基于马尔科夫逻辑网的中文最大名词短语识别方法。该方法避免对开放类的名词短语的识别,而将主要精力放在了相对封闭的分隔符的识别上,利用句子自身特征、领域迁移特征以及双语对齐特征来识别最大名词短语的边界。结果说明,双语信息较好地促进了动词、介词、连词等MNP边界的识别。MNP识别的F值可达83.27%。
The main problems that limited the development of Maximal-length Noun Phrases recognition on Chinese patent literatures are the lack of annotated corpus and the difficulty of recognizing verbs and nouns.This paper presents a new Markov Logic approach to maximal-length noun phrases identification from Chinese patents.Instead of recognizing various of noun phrases,the approach focuses on the identification of MNPs boundary markers.To recognize Chinese patents MNPs,three categories of features,i.e.word features from sentences,transfer features from TreeBanks and bilingual features from patentsabstractions,are employed.The experiment results show that bilingual features can bring a notable improvement on identification of MNP boundary markers such as verbs,prepositions and conjunctions.And the F-score on MNP identification reaches 83.27%.