东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

基于文本内容的农业网页信息抽取和分类研究

ISSN号：1007-7634
期刊名称：《情报科学》
时间：0
分类：G350[文化科学—情报学]
作者机构：[1]南京大学信息管理系,江苏南京210093, [2]南京大学多媒体信息研究所,江苏南京210093
相关基金：2008年国家社科基金重点项目（08ATQ003）

关键词：文本, 农业网页, 信息抽取, 分类, text , agricultural web, information extraction, classification

中文摘要：

通过对农业网页的HTML结构和特征研究,叙述基于文本内容的农业网页信息抽取和分类实验研究过程。实验中利用DOM结构对农业网页信息进行信息抽取和预处理,并根据文本的内容自动计算文本类别属性,得到特征词,通过总结样本文档的特征,对遇到的新文档进行自动分类。实验结果表明,本文信息提取的时间复杂度比较小、精确度高,提高了分类的正确率。

英文摘要：

Through the investigation and analysis of their structures and features of HTML in the agricultural websites, the paper described the methods of the information extraction and classification for agricultural webs. The main contents included： information extraction and classification for agricultural webs based on document object model （DOM） structure; automatic calculation of text classification attribute according to its contents; obtaining feature words; and automatic classification of new documents through the summary of sample document features and The experimental results showed that the time consumption of web information extraction was lower while its exactness kept higher, with satisfactory classification rates.

同期刊论文项目