为了改善从Web上获取的新闻信息的使用价值,针对Web网站存在大量非科技相关新闻的现状,以互联网上政府新闻网站、凤凰网等新闻为研究背景,选取TF-IDF文本加权方法,设计了科技新闻多层次二分类模型,实现了基于TFIDF的科技新闻文本分类抽取系统,在20万新闻文档和4000多种分类上,实验取得了科技新闻85.3%的识别准确率和非科技新闻82.9%的识别率,为Web科技新闻分类抽取提供有实用价值的参考模型。
There are a lot of non-scientific-related news on Websites. In order to improve the useful value for the news information,a novel multilevel dichotomous model of text automatic categorization extraction system for technology news based on TF-IDF was designed and implemented. The news offered by government news website and Phoenix as the research background in scientific news categorization extraction. Experiments showed a85. 3 percent accuracy for scientific-related news and 82. 9 percent recognition rate for nonscientific-related news respectively in the test containing two hundred thousand documents and more than four thousand news classifications. The results showed that the proposed method offered a useful reference model on website scientific intelligence.