新术语的提取是中文信息处理领域的一个重要研究课题。针对现有提取方法的不足和很多专业术语表现为字母词语的特点,该文提出了一种综合统计技术和规则筛选的方法:基于长串优先和串频统计的思路进行文本切分,得到共现字符串,利用词语搭配规则进行过滤,经过领域词典及评价函数的筛选,提取出领域新术语。该方法可发现包含字母词语、专业术语等未登录词在内的频率大于等于2的任意长度的专指语义串、短语和词。实验表明了该方法的有效性及新术语的准确率分布特征。
Extraction of new domain-specific terms is one of the important topics in Chinese natural language processing. Aiming at the limitation of the current methods and the specialties of many domain-specific terms are lettered-words, a novel approach combined with statistic technique and rule is proposed to extract new special semantic strings. Co-occurrence of character strings is formed by text segmentation based on matching longer strings first combined with frequency statistics. No-meaningful character strings are trimmed by collocation rules. Filtered by domain lexicon and membership degree, new domain-specific terms are extracted finally. This method can extract new special semantic strings, phrases and words, including unknown words like lettered-words and domain-specific terms, their frequency is larger than 2. Experiments show that this extraction technique is effective and indicate new domain-specific terms' distribution characteristic of precision ratio.