特征降维是文本分类面临的主要问题之一。首先通过x^2分布对特征项进行选择,然后使用一种改进的基于密度聚类方法对选择后的特征项进行聚类,借助类别分布信息,在尽量减少信息缺失的前提下先后两次对文本特征维数进行了压缩;在基于类别概率分布的模式下实现文本的矩阵表示,借助矩阵理论进行文本分类。试验结果表明,该方法的分类效率较高。
The feature reduction is one of the main problems in text classification ,Firstly, the authors select features by using CHI distribution. Secondly,the authors cluster the selected features by using an improved method which based on density dustering. In virtue of the sort distribution information, the authors reduce the number of features twice and the information lost few, Lastly, based on the sort of texts, the authors use the distributing of probability to express text with matrix, and realizes the text categorization by using matrix norm. The experiment shows that this method has a higher precision for the text classification.