根据Web页面中出现的重复信息对Web页所体现的语义进行表示,可以提高Web页分类正确的精度.基于这一思想,本文通过对传统重复模式表示法的分析,提出基于重复模式的Web信息语义表示法.该方法在形式化描述重复模式的基础上,抽取Web信息中的重复模式建立表达Web信息语义特征的相关矩阵,并通过γ相似匹配算法计算重复模式的权重继而进行Web信息分类.实验证明,采用基于重复模式的Web信息语义表示法能够较好的体现Web网页信息的主题特征,可以提高Web信息分类的准确率.
The method that using repeating information appeared in Web pages to represent the semantic meaning can be used to improve the correct rate of Web pages classification. Based on the thought above, this paper analyses and improves the traditional repeating patterns representation and further proposes a new semantic representation of Web information based on repeating patterns. First, the repeating patterns are formal described and then the repeating patterns of Web information are extracted and the correlative matrix is built. At last, γ approximate matching algorithm is used to compute the weight of repeating patterns and categorize the Web pages. Experiment result shows that semantic representation of Web information based on repeating patterns is good at the extraction of Web pages' topic characters, and this approach can also improve the accuracy of Web information classification.