网页分类是网络挖掘的重要研究内容之一.与文本分类相比,网页分类面临的困难更多.去除网页中的噪声信息可以提高网页分类的精度,基于摘要的网页分类方法利用了这一思想.本文对三种传统的网页摘要方法进行了分析和改进,提出了ContentBody摘要方法以及基于四种摘要方法的混合摘要方法;在此基础上,进行了大量基于摘要的网页分类实验.实验结果表明,所有的摘要方法都可以提高分类效果,其中混和摘要方法效果最好,可以使分类的F1值得到12.9%的改进.
Web-page classification is an important research direction of web mining and much more difficult than pure-text classification. The accuracy of web-page classification can be heightened by getting rid of noisy information embedded in web pages, and the idea is utilized by our proposed summarization-based web-page classification method. In the paper, three traditional web-page summarization methods are analyzed and improved, and the Content Body sum- marization method and an ensemble summarization method based on four summarization methods are proposed. A large amount of experimental results of web-page classification based on summarization show that all the summarization methods can improve the performance of web-page classification algorithms and the ensemble summarization method achieves a 12.9% improvement over pure-text based methods.