Web页面由于其在表达信息的丰富性方面远胜于纯文本文件,因此web页面分类与纯文本分类不同。针对网上中文新闻页面特点,我们提出了一种无需词典的从Web页面中抽取主题的实用算法。并将提取出的类主题概念融入分类用知识库,然后用我们研究小组提出的混合分类算法进行分类,实验语料取自新华网财经新闻。实验结果表明:与不使用web页面特征,仅用全文相比较,分类性能有所提高。
Web page abundant in contents than text. According to Internet Chinese news pages, pure text.Web page categorization different from pure we present a practical algorithm for extracting subject concept from Web page without thesaurus. And melt these category-subject concept into knowledge base, then classify using hybrid algorithm, experiment corpus excerpt from xinhua net.Experiment result shows: compared with only using full text, categorization performance improved using Web page feature.