针对传统Web文本分类方法无法解决大规模分类问题,在深入分析当前主流并行计算平台Hadoop的基础上,提出基于Hadoop的Web文本分类系统,该系统主要包括文本预处理、向量表示、文本分类、结果评价等模块。真实数据集上的比较实验表明所建系统的有效性。
In order to solve the poor performance problem of traditional web text classification approaches in dealing with large-scale data, a web text classification system based on Hadoop was designed. The constructed system mainly includes text preprocess, vector representation, classification and result evaluation. Comparative experiments on the authentic dataset verified the effectiveness of the constructed system.