文本挖掘之前首先要对文本集进行有效的特征选择。传统的特征选择算法在维数约减及文本表征方面效果有限,并且因需要用到文本的类别信息而不适用于无监督的文本聚类任务。针对这种情况,设计一种适用于文本聚类任务的特征选择算法,提出词条属性的概念。首先基于词频、文档频、词位置及词间关联性构建词条特征模型,重点研究了词位置属性及词间关联性属性的权值计算方法,改进了Apriori算法用于词问关联性属性权值计算;然后通过改进的K-means聚类算法对词条特征模型进行多次聚类完成文本特征选择。实验结果表明,与传统特征选择算法相比,该算法在获得较好维数约减率的同时提高了所选特征词的文本表征能力,能有效适用于文本聚类任务。
Effective text feature selection is the precondition of text mining. Conventional text feature selection method has limited effect on dimension of eigenvector reduction and text representation. Besides, conventional text feature selection met- hod is not suitable for unsupervised text clustering. In view of above, this paper proposed a novel algorithm of text feature se- lection based on the concept of vocabulary attribute suitable for text clustering. Firstly, the algorithm constructed the model based on vocabulary attribute including term frequency, document frequency, term position and term correlation. Then it ana- lyzed the approach to calculate attribute value in detail and improved Apriori algorithm to calculate attribute value of term cor- relation. Finally it clustered on the vocabulary attribute model by the improved K-means clustering algorithm to complete the text feature selection. Experimental results show that this proposed scheme can effectively reduce the dimension of eigenveet0r and improve the text representation capability of feature vocabulary compared to the traditional methods, and meets the actual demand for text clustering.