提出一种新的英文文本检索算法,该算法将英文文本映射为26阶频率矩阵,然后通过奇异值分解,对文本表示空间进行降维处理,并融合第一奇异值分量和第二奇异值分量的特征,得到既反映字母统计频率,又反映文本字符间顺序结构的复特征向量,最后利用向量间余弦相似度作为文本检索的相似度度量。数据对比表明,算法取得了较好的实验效果,且在检索准确率和运算效率上优于经典的LSA算法。
In this paper,a new retrieval algorithm for English texts is proposed.First of all,the English texts are mapped into frequency matrixes of order 26 and the dimensions of texts representation space are reduced through singular value decomposition.Second,it fuses the features of the first singular value component and the second one,and then gets the complex feature vectors which reflect not only the statistic frequency but also the sequential structure of letters.In the end,the cosine similarity of texts is used to measure the similarity between the query and documents.The data comparison indicates that this algorithm has well experimental results.Moreover,it gets the advantage over the classic LSA retrieval algorithm in precision and operational efficiency.