为了给出网页信息提取方法的数学形式化的理论分析,首先用一维空间域的信息函数来表示网页信息,并通过分析网页过滤过程,推导出网页信息过滤定理.然后通过分析网页的相似性,推导并提出一种基于相关过滤的网页特征信息的提取理论.在这个理论基础上,融合基于标识规则和基于内容规则的两种方法,提出一种基于特征相关学习的网页信息提取方法.导出的特征提取理论和实验结果表明这种方法具有较好的准确率.
Web page information was denoted by one-dlmension space information function to present the theoretical analysis of information extraction of Web pages in mathematical form. The theorem of Web page information filtration was deducted after analyzing the process of Web information filtration. And then a novel feature extraction theory of Web pages based on correlative filtration was deducted after analyzing similar features of Web pages. Thus based on the feature extraction theory, a novel adaptive information extraction method with feature learning for Web pages is proposed, through combining the label-based extraction method and the context-based extraction method. Both the deducted feature extraction theory and the experiment results show the adaptive information extraction method for Web pages has good accuracy.