如何从纷繁复杂的网页中抽取有价值的信息是信息检索和Web数据挖掘中的重要问题。利用网页集信息所呈现的分布特点,提出基于互信息度量的Web信息抽取方法,它能够自动识别噪声信息并保留关键信息。该方法将网页解析成DOM树,计算叶子节点的互信息值;然后按DOM树结构对叶子节点进行分块聚集,向上递归求得标签〈boay〉的互信息值,并以此作为阈值区分噪声与非噪声。最后与多个国内知名网站上的实验及对比结果证明了该方法的有效性。
How to extract valuable information from complex web pages is an important issue in information retrieval and Web data mining. We utihse the distribution feature presented by the information of webpage set and propose a mutual information metric-based Web information extraction method, it can automatically identify the noisy information and keep the key information. In this method, webpage is parsed into a DOM tree and the mutual information value of leaf nodes is calculated. Then the leaf nodes are block aggregated according to the structure of the DOM tree, the mutual information value of tag 〈 body 〉 is upward recursively computed and is set as the threshold to distinguish the non-noise from noise. Experiments and contrast results on various famous domestic websites prove the effectiveness of the proposed method.