基于KNN的主流文本分类策略适合样本容量较大的自动分类,但存在时间复杂度偏高、特征降维和样本剪裁易出现信息丢失等问题,本文提出一种基于特征库投影(FLP)的分类算法。该算法首先将所有训练样本的特征按照一定的权重策略构筑特征库,通过特征库保留所有样本特征信息;然后,通过投影函数,根据待分类样本的特征集合将每个分类的特征库映射为投影样本,通过计算新样本与各分类投影样本的相似度来完成分类。采用复旦大学国际数据库中心自然语言处理小组整理的语料库对所提出的分类算法进行验证,分小量训练文本和大量训练文本2个场景进行测试,并与基于聚类的KNN算法进行对比。实验结果表明:FLP分类算法不会丢失分类特征,分类精确度较高;分类效率与样本规模的增长不直接关联,时间复杂度低。
Considering that KNN algorithm has some disadvantages such as high time complexity, feature reduction, sample clipping and information loss, a feature library projection(FLP) classification algorithm was proposed. Firstly, the algorithm reserved all the features and characteristics of the training sample weight in the feature library. The data in this library were changed into new projection samples through the projection functions. By calculating the similarity of the new sample with the projection samples, data classification could be achieved. Based on the text classification, the effectiveness of the algorithm and texts, the data were validated under two conditions, i.e. small training texts and large training texts, and it was compared with KNN algorithm. The results show that the FLP algorithm does not lose the classification feature, and the classification accuracy is higher than that of other ones. The classification efficiency is not directly related to the sample size growth, and the time complexity is low.