针对Web页面分类方法一般只能处理小规模数据的问题,提出一种核心子集选择训练的大规模中文网页分类方法.该方法通过将支持向量机的最优化求解问题转化为等价的近似最小闭包球求解问题,使得只需选择数据集的核心子集参与分类器训练;并且,在特征选择阶段采用改进的基于词性的互信息特征选择模型,有效提高Web页面分类的大规模数据处理能力.在搜狗实验室提供的大规模Web页面数据集上进行了实验,实验结果表明不仅准确率可达到支持向量机同等的效果,且训练时间大大减少;而对不均衡类别数据的测试结果表明,该方法在处理不均衡类别数的Web网页分类上也能获得很好的效果.
Aiming at the shortcoming that the major existing webpage classification methods only can process small scale dataset,a Chinese webpage classification method based on approximate minimum closure ball(AMCB) was proposed in this paper.Through transformed the optimization solution of the support vector machine to solution of approximate minimum closure ball equivalently,the webpage classifier's training process can be completed quickly by only selecting a core subset of the original large scale dataset.Moreover,adopting an improved mutual information feature selection model based on part-of-speech,the feature subset of the WebPages was extracted to classify the WebPages.So the AMCB can deal with large scale webpage dataset.The experiments were executed on open large scale webpage dataset which providing by Sogou Labs.The experiment results showed that the AMCB can provide good classification precise and quick run-time speed,furthermore can provide good performance to not balance web page classification.