研究了基于频率共现熵的跨语言网页自动分类问题,使用翻译软件将所有中文网页翻译为英文,计算中文和英文网页的共现特征频率共现熵值,确定中文和英文网页的共现知识,并与英文网页相结合训练中文分类模型.实验结果表明,该方法与贝叶斯分类模型、向量空间分类模型和信息瓶颈模型相比体现出良好的性能.
An approach to address the cross-language web pages automatic classification problem based on frequently co-occurring entropy(FCE) is been proposed.The algorithm first translating all Chinese web pages to English by simple translation software.Second,computing the frequently co-occurring entropy using all Chinese and English web pages.Third,selecting the common part between Chinese pages and English pages based on the FCE ranks.Last,training a Chinese classification model by English pages with the common part.The experimental results in ODP corpus show the method performs well performance than NB,SVM and IB models.