潜在语义标引是一项无监督的特征抽取技术,并且其有效性在信息检索等多个研究领域得到证明。由于该技术的特征抽取效果完全依赖于数据的特征分布,因此对数据的优化能够较好改善技术的有效性。提出了一种潜在语义标引的优化技术一增广空间模型,同时提出了基于文档长度和特征DF分布状态的数据分割策略,该策略的提出能够使子空间尽可能继承原始空间的良好结构。实验证明合理的子空间分割策略,不但保证了正确率,同时极大地缩短了算法的运行时间。最后,采用增广空间模型,将不同子空间进行融合,并获得较好的性能。在分类实验中分类正确率已达85.92%。
Latent Semantic Indexing is an unsupervised feature extraction technology, and its effectiveness has been proven in several research fields such as information indexing. Because the effect relies entirely on the characteristic distribution of data, optimizing the data can improve the technology's effectiveness. An op- timized technology of the Latent Semantic Indexing-Augmented Space Model has been proposed, and a new strategy based on the documents' lengths and distribution of the features' DF is also presented in this paper, which can ensure that the favorable structure of big scale corpus can be inherited by the two subspaces as far as possible. Experiments prove that precision and a shorter time of the algorithm can be obtained by an ap- propriate subspace dividing strategy. In the end, this paper shows a better performance-the precision in the classification experiment is 85.92%-by adopting the Augmented Space Model to integrate different subspaces.