针对以往研究将搭配视为词项的简单并置,而没有考虑词项间的倾向性问题,提出了一个基于相对条件熵的搭配倾向统计模型,衡量中心词对上下文同现词的依赖程度.此外,加入语言学启发式规则,利用词性过滤器和滑动窗口的方法识别搭配边界,最终形成了在开放语料库环境下的搭配抽取方法.该方法具有很强的解释性,有效地揭示了搭配构成的内在机理.经过证明,搭配倾向强度可以解释为由方向修正的互信息.
Current researches on collocation extraction consider that lexical combination is simply to put terms together, but ignores the collocation preference. To solve the problem, the collocation preference statistic model based on relative conditional entropy is brought up to measure the dependence between headword and co-occurrence words in context. Then the linguistic heuristic rule is integrated to identify the border of collections, by part-of-speech filter and sliding window. Finally, an approach of collocation extraction is formulated. The approach is able to effectively disclose the internal mechanism of collocation and it shows more understandable. It is proved the collocation preference strength could be considered as mutual information corrected by directions.