目前大多数自动标引方法不能有效利用文本中包含的多个特征。而支持向量机、条件随机场模型等统计机器学习模型能够有效利用文本包含的多种特征进行关键词提取。同时,由于各种自动标引模型性能各异,综合利用各种模型进行集成学习方式的自动标引,能够提高自动标引的质量。为了进一步提高自动标引的质量,本文试图整合统计机器学习模型与集成学习方法的优势,对文档进行基于多分类模型综合投票方式的自动标引。实验结果表明基于集成学习方法的自动标引能提高标引结果的查准率和召回率。另外,集成学习标引模型中,基分类器加权的标引结果,优于基分类器未加权的标引结果。
Currently, most methods of automatic indexing cannot use the features of documents effectively. The statistical machine learning models including support vector machine, conditional random fields, can use the features of documents more sufficiently and effectively. At the same time, the automatic indexing models performance varies in the task of automatic indexing. ff we combine these models to index the documents by ensemble learning, the performance of indexing can he improved. In order to improve the performance of indexing, a method which integrates the statistical machine learning models and ensemble learning is proposed in this paper. This method indexes the documents through voting of multiple indexing models. Experimental results show that the indexing method based on ensemble leaning outperforms other methods according to the precision and recall measurement. Moreover, the indexing model based on ensemble learning with the weighted voting outperforms the model without the weighted voting.