Web文档间的相似性度量是Web文本分类的关键,有效的相似性度量策略可改进Web文本分类的精度.经典的向量空间模型(VSM)仅考虑网页中单词的出现频率,未有效利用单词的分布信息,因而影响了网页的分类精度.论文计算了网页中单词分布位置的均值和方差,并将之引入到网页的相似性计算中,提出了一种直接嵌入分布信息的新的网页相似性度量方法.该方法因合理利用单词的出现频率及其分布信息,可有效改进和拓展经典的网页相似性度量策略.实验结果表明,该网页相似性度量方法是有效可行的.
The similarity measurement for Web pages is a key issue for Web pages categorization. Effective similarity measurement strategies can efficiently improve the accuracy of Web pages classification. Traditional Vector Space Model (VSM) only uses the frequency of each selected word in the pages, does not make efficient use of the distribution infor- mation such as the average position and bias of the word, hence the method has a great impact on the accuracy of the pa- ges classification. Therefore, in this paper, the means and variances of the words in the document, which are applied into the similarity measurement method, are computed, and a novel method for the similarity measurement of Web pa- ges, that is directly embedded by the distribution information, is present. This approach can effectively improve and ex- tend the classically similarity measurement strategies for Web pages, which properly incorporates the distribution infor- mation into the similarity measurement of Web pages. Experimental results show that the method of this paper is efficient and flexible.