文本向量化是将文本转化为向量的代数模型建立过程,在文本处理领域具有重要的应用价值,是文本数据挖掘算法的关键环节。在著名的PageRank算法基础上,提出一种基于句中词语间关系的文本向量化算法。通过引入语义层面的词语关联来克服传统的基于词频统计数据的向量化方法语义敏感度不佳的缺陷。在不同的语料测试集上的实验表明,基于句中词语间关系的文本向量化算法有更高的准确率。
Document vectorization is the process of building vector space model which has a number of potential applications on natural language processing. This paper describes an algorithm of vectorization through the relationships of word in a sentence based on the PageRank algorithm. The introduction of semantics relationship is then proposed to overcome the disadvantage of traditional statistics-based vectorization. Experimental results show that the new method has a better accuracy rate.