中英文混合术语可作为未登录词处理、加权处理和歧义消解等的辅助信息,并有助于提高中文信息处理的质量。依据长度递减与串频统计思想,本文提出了一种中英文混合术语的抽取方法。该方法不需要词典,不需要事先进行语料库的学习,不需要建立字索引,而是依靠统计信息,抽取出支持度大于等于阈值的中英文混合术语。该算法能够有效地抽取出文本中新涌现的通用词、专业术语及专有名词。实验显示该方法不受语料限制,能够快速、准确地进行中英文混合术语的抽取。
Terms combined with Chinese and English can provide supplement knowledge for the un-login words processing, word weighting and word disambiguation, and can improve the quality of Chinese information processing. This paper presents an algorithm extracting terms combined with Chinese and English based on string length descending and statistics of string frequency. This algorithm can automatically extract terms combined with Chinese and English without thesaurus,without acquiring the probability between words in advance and without character index. This algorithm can effectively extract new universal words, specialized terms and proper nouns.The experimental results show that it can work on arbitrary text and has high speed and accuracy.