知识获取是制约基于语料库的词义消歧方法性能提高的瓶颈,使用等价伪词的自动语料标注方法是近年来解决该问题的有效方法。等价伪词是用来代替歧义词在语料中查找消歧实例的词。但使用等价伪词获得的部分伪实例质量太差,且无法为没有或很少同义词的歧义词确定等价伪词。基于此,该文提出一种将等价伪词获得的伪实例和人工标注实例相结合的词义消歧方法。该方法通过计算伪实例与歧义词上下文的句子相似度,删除质量低下的伪实例。并借助人工标注语料为某些无等价伪词的歧义词提供消歧实例,计算各义项的分布概率。在Senseval-3汉语消歧任务上的实验中,该文方法取得了平均F-值为0.79的成绩。
The corpus-based method for word sense disambiguation (WSD) suffers from "knowledge acquisition bottleneck" problem. The automatic lexical sample acquisition method based on equivalent pseudo-words (EPs) is an effective way to solve of this problem. However, some pseudo-samples collected by EPs have low quality and the EPs can not be acquired when the ambiguous word has few monosemous synonyms. This paper proposes a WSD method combining pseudo-samples and man-acquired samples. The method calculates the sentence similarity with the context of the ambiguous word to remove pseudo-samples with low quality. Moreover, the method utilizes the manually-tagged corpus to get the sense distribution probability and provide samples for the ambiguous words that have little monosemous synonym. Our method achieves an average F-measure of 0.79 through the WSD experiments performed on Senseval-3 Chinese lexical sample task.