词义知识获取是词义知识库建设、词义消歧等任务的基础和起点,目前该工作基本依赖人类专家的智慧和洞察力,在大规模文本处理上缺乏意义计算的客观性和一致性。该文以汉语的中高频形容词为样本,深入挖掘词义特征并采用有参数初始化过程的EM迭代算法,实现了从真实文本中自动发现并区分词语词义的过程。该词义区分算法选取易获取的词形特征、基于大规模语料的搭配特征、基于网络语料的属性—宿主关系特征,替代以往难以获取的句法结构特征,并进一步利用HowNet优化了词形特征的选择。该工作可以应用于信息检索等领域,能够对现有词典起到修改和补充的作用,该思路亦可扩展到其他汉语词类上去。
Lexieal knowledge acquisition is the bottleneck for many tasks like word sense disambiguation, lexieal knowledge base construction et al. This paper introduces an automatic word sense discrimination method for Chinese mid-high-frequency adjectives. We employ the EM algorithm and exploit the features of Chinese character, contextual bag-of-words and host-attribute pair instead of the more unreliable syntactic information. We further optimize the morphology selection by utilizing HowNet in our work. The experimental results show that word sense discrimination results are different from Chinese lexicons and could be used for lexicon modification and expansion even for other type of Chinese words.