本文提出了一种标签路径和行块分布函数相结合的信息抽取方法来实现Web页面的信息抽取。该方法将Web页面解析成DOM树,使用视觉特征和标签过滤的规则将树进行剪枝,引入标签路径特征的方法粗略划分出网页的正文内容和噪音内容,最终使用行块分布函数的方法进行抽取,获得正文文本。实验结果表明,这种抽取方法有效地防止了正文内容误删及噪音内容漏删的现象,使得提取的正文信息更加准确,准确度达到91%,召回率达到95%,F值达到93%。本算法对于包含过多短文本的网页抽取的准确度还有待提高。
In this paper, an information extraction method combining tag path and block distribution function is proposed to extract information from Web pages. The Web page is parsed into a DOM tree in first step. Secondly, the DOM tree is pruned by using visual features and label filtering rules. And then introducing label path characteristics, Web information is roughly divided into two parts: text content and noise content. Finally, using row block distribution function to extract text, the text is utterly obtained. The experimental results show that this method can prevent that the text is mistaken to delete and the noise content is missed to delete effectively, making the extraction of text information more accurately. The results shows that the precision reaches 91%, the recall rate 95%, F score 93%. The accuracy of the algorithm for Web pages which are containing too many short texts still has to be improved.