针对文本分类中特征空间的高维性导致的“维数灾难”问题,提出了一种基于流形学习的文本分类模型,该模型利用流形学习算法对高维文本特征数据降维后再分类.同时针对夹角余弦中存在的相似性偏移问题,提出了一种新的文本相似性度量措施——特征词相交距离,其实质是计算两个文档中所包含的特征词的交集,并将该措施作为流形学习算法中选择邻域的依据.实验结果表明,特征词相交距离较好地表达了文档之间的相似性,利用基于特征词相交的流形学习算法对文本数据降维后再分类,在保证分类精度的前提下极大地提高了分类算法的执行效率,克服了采用欧式距离和夹角余弦选择邻域造成低维流形的扭曲从而导致的分类精度降低的问题.
To overcome the problem of"curse of dimension" caused by the high dimensional text data, a text classification model based on manifold learning was proposed. In the model, the original text data were reduced with the manifold learning methods and the low dimensional features were classified. At the same time, to solve the problem of the similarity deviation in the angle cosine, the item word intersection as a new similarity measure was presented, which computed the intersection between the item words contained in two documents. And the measure was used to select the neighborhood in the manifold learning methods. Experiments demonstrated that the item word intersection distance better described the similarity between documents. The executing efficiency of classification algorithms were greatly improved while assuring the classification accuracy through extracting the low dimensional features from the text data with manifold learning algo- rithms. It solved the problem that selecting the neighborhood with Euclid distance and the angle cosine distance caused the distortion of the low dimensional manifold and led to the lower accuracy.