针对中文在线评论中产品属性词的提取,提出了一种基于互自扩展模式的半监督学习方法。利用较少的人工参与,通过FP—Growth算法挖掘频繁项集获得种子属性词,通过增量迭代发现新的属性词;在每一轮迭代中,通过计算提取词与提取模式的置信度确保了算法的准确性,同时避免了主题偏移。最后通过相似提取模式获得复合提取词,大大减少了因分词及词性标注错误所导致的属性词挖掘错误,以牺牲较少准确率的代价换取了较高的召回率。实验结果表明,该算法对产品属性提取的F值可以达到78.97%,结果优于其他类似的提取算法。
This paper proposed a feature extraction method based on mutual self-expanding in Chinese product comment. With little manual work, this method found seed features by FP-Growth, then found the other new features by an incremental iterative procedure. During the iteration, the confidence coefficient of the extracted-word and the extracted-mode insured a high precision, avoided deviating theme at the same time. At last, this method found combination extracted-word by similarity ex- tracted-mode. It could reduce many feature extraction mistakes caused by word segmentation technology and part-of-speech tagging technology, and got a high precision with reducing little recall rate. The experimental results indicate that the F-score of the proposed method for product feature extraction can be 78.97%, is better than the other method of the literatures of this paper.