通过对农业网页的HTML结构和特征研究,叙述基于文本内容的农业网页信息抽取和分类实验研究过程。实验中利用DOM结构对农业网页信息进行信息抽取和预处理,并根据文本的内容自动计算文本类别属性,得到特征词,通过总结样本文档的特征,对遇到的新文档进行自动分类。实验结果表明,本文信息提取的时间复杂度比较小、精确度高,提高了分类的正确率。
Through the investigation and analysis of their structures and features of HTML in the agricultural websites, the paper described the methods of the information extraction and classification for agricultural webs. The main contents included: information extraction and classification for agricultural webs based on document object model (DOM) structure; automatic calculation of text classification attribute according to its contents; obtaining feature words; and automatic classification of new documents through the summary of sample document features and The experimental results showed that the time consumption of web information extraction was lower while its exactness kept higher, with satisfactory classification rates.