为了准确检测出仿冒网站,提出了一种基于网页结构的页面相似度计算方法.该方法首先将网页分块并进行合理的筛选,其次通过初步比对确定相似节点群,最后将网页数据量化并计算出网页是否相似.试验表明,该方法可以有效地检测出网页相似情况,对于仿冒网站的镜像尤其明显,误报率及漏报率均不超过10%.
To detect spoofing websites accurately and effectively, a page similarity computing method based on webpage structure is proposed. The method firstly handles the webpages by segmentation and performs reasonable filtering, then, determines similar-node groups by preliminary comparison, and finally, quantifies the webpage data and cal- culates whether the webpages are similar or not. The proposed method was tested by experiment, and the results showed that it could detect the similarity of webpages accurately and effectively, especially to the mirror images of spoofing websites (the false positive rate and the false negative rate were both no more than 10% ).