在文本分类的实际应用中经常使用粗略分类的数据来训练分类器,但是这种数据中经常会包含类别标记有误的数据,这些数据对文本分类结果的精度会造成不良影响。本文针对这个问题提出了一种噪声修正算法,首先建立文档关联网络,把文档上标记的类别作为在网络上划分的集团结构,并用模块度衡量集团结构的质量,通过优化模块度指标把噪声数据调整到合适的类别中,从而提高数据质量。实验结果表明,本文所提算法能够有效修正粗分类数据中的噪声,且有较高的有效性和鲁棒性。该算法可以用于文本分类训练数据的预处理,或作为辅助技术用于文献库建设等工作。
Training data is necessary to train the classifiers in Text Categorization. In fact, there are always some documents distributed to a wrong category in training text corpus, which are named noise texts. If we use noise texts in text mining applications directly, the efficiency of the text mining will be influenced. This paper proposes a revision algorithm for noise texts based on network. Firstly, document-similarity network (DSN) is constructed. The categories constitute the corresponding community structure in the network, and modalarity is used to evaluate the quality of the categories. The noise texts can be revised through modularity optimization. The experimental results indicate the efficiency and robustness of the algorithm. This algorithm can be used in the preprocessing of text mining or taxonomy building.