网页正文信息的正确提取与分段,对文本信息挖掘等具有重要的意义。本文提出并实现一种从Web页面获取正文信息并能够正确分段的方法。该方法首先利用网页布局标签〈table〉和〈div〉构建一个DOM结构树,然后利用DOM结构树所反映出的布局标签的嵌套关系,对内容块进行取舍,提取出正确的正文信息,最后利用对一些特殊标签的处理,实现正文信息的分段。实验表明,该方法易实现、效率高,能自动准确地提取正文信息并分段。
Correct extraction and segmentation of Web information is significant to text information mining. The paper proposes and achieves a method which can get informative information from Web page and be able to follow the correct segmentation of the original text. The method first uses page layout tag 〈 table 〉 and 〈 div 〉 to build a DOM structure tree, and then uses the nested relations of the layout label, that the DOM structure tree reflects to choose the content blocks, extract text information correctly, and finally achieves information segment of the body through the manipulation of some special tags. The experimental results prove that this method is easy to realize and high efficiency and it can automatically extract informative message and section accurately.