分类是当前机器学习的重要研究内容之一,已取得了一定的进展.现有的文本分类方法大多基于VSM模型,而VSM未能有效地利用隐含在文本中的结构信息.同时,VSM下的样本空间常常是高维的,单一的降维策略可能会丢失有用信息.为改进现有算法的不足,提出了一种基于多模态模型的随机子空间分类集成算法MMRFSEn,有效地利用文本中的结构信息(单词分布位置的均值和标准差),且各基分类器是由随机选择的子空间构建而成.实验结果表明,该方法是有效可行的.
Text Classification is an important machine learning research, in which some progress has been made. Most of the existing classification methods are based on Vector Space Model(VSM) , but VSM does not effectively utilize the structure information hidden in the text samples. At the same time, VSM vectors are often high-dimensional, merelv using dimensionality reduction strategy may lead to the loss of the useful information. To uvercome the shortcomings of the existing algorithms, we propose an algorithm called Multi-modality-based Random Feature subspaee classifier Ensemble (MMRFSEn) . which can eft)etively use the structure information hidden in the text such as the words' s average localion and standard deviation, and meanwhile each single classifier is constructed by a randomly selected subspace. The experimental results show that the newly developed method is effective and feasible.