提出了一种针对网页结构树的相似度计算方法,首先把网页标签结构表示成树,然后通过动态规划算法,使两棵树在每一层上最相似的儿子节点继续进行比较,而那些没有找到匹配节点的儿子节点则产生距离,累加这些距离作为两棵树之间的距离,以此来衡量两个网页之间的相似程度。实验证明本方法可以正确区分同类网页和不同类网页。
A similarity calculation method for tree-structured web pages is proposed. The structure of web page labels are firstly transformed into tree, and then make the most similar son nodes between each layer of the two trees continue comparing by a dynamic programming algorithm, the nodes which miss match are regarded the part of distance, the total distance between two trees are computed by adding in all the parts of distance through which to calculate their similarity degree. The experimental result shows that this method can effectively and precisely distinguish different web page.