作为一种词汇现象,词语搭配在自然语言处理的许多领域具有重要的应用。本文对4种词语相关性度量和3种词语结构分布度量分别进行了比较分析,并提出了一种基于互信息与熵融合的获取词语搭配的方法。实验结果表明:在同现频率较高情况下,互信息、Cosine系数、χ^2测试和似然比测试4种相关性度量对搭配判定有大致相同的效果;在度量词语的结构分布方面,熵要优于方差和离散度。本文所提方法依赖度量指标少,闽值容易选取,且与其他已有的方法具有同等效果。
As a kind of word phenomenon, collocation plays a very important role in nature language processing. In this paper, 4 kinds of word association measurements and 3 kinds of word structure distribution measurements are compared and analyzed respectively, and a hybrid method based on mutual information and entropy for collocation is proposed. The experiment results indicate that 4 kinds of word association measurements, mutual information, Cosine coefficient, χ^2test and likelihood ratio have the same impact under high co-occurrence frequency for collocation acquiring and entropy is superior to variance and spread for measuring the word structure distribution. The proposed method relies on fewer measurements and can easily selects coefficient thresholds and achieves the same impact of the existing methods.