传统的网页聚类方法存在准确率不高和计算复杂度高的问题。因此,文章提出了一种新型的基于URL相似性和简单DOM树的网页聚类方法,使用树匹配算法进行去噪,之后再利用统计的方法进行网页类型判断。实验结果表明,该方法达到了较高的准确性。
Traditional web page clustering methods exist low accuracy and high computational complexity.The article puts forward a new Web pages clustering method based on URL similarity and simple DOM tree, denosing by using tree matching algorithm ,then using statistical methods to identify Web pages type. The experimental results show that the method achieved higher accuracy.