当前Web社区识别算法大都基于纯链接分析,忽略了Web的文本属性.针对Flake等人提出的基于最大流算法的社区识别框架的不足(如赋予网页之间的链接不公平的权重、排序策略单一等),提出了一种结合网页内容分析与链接分析的改进算法.首先,提出一种新的基于文本相似度的边容量分配方法.基于网页间内容越相似彼此传递的权威度越大的特点,将网页的内容相似度用于Web图的边容量设置上,具体策略为Max-flow+TF—IDF边容量设置和Max-flow+TF-IDF+Seeds边容量设置.其次,提出的社区结点的排序策略充分考虑了结点和社区主题的相似度,以此来增强结点区分度.理论分析和实验证明了该算法具有提高社区发现的精度和大小、计算出的排序分值更为客观合理等优点.
Most studies on Web community extraction only focus on pure link analysis, thus textual properties of Web pages that are interconnected via complex hyperlinks are neglected. An improved algorithm based on Flake's method using the maximum flow algorithm is proposed in this paper. Based on the fact that the more similar contents the two pages have, the more authority they exchange, the lexical similarity of Web pages is used for the assignment of edge capacities. In this paper, two methods, MT (Max-flow 4- TF-IDF) assignment and MTS ( Max-flow + TF-IDF Jr Seeds) assignment are introduced. Furthermore, we also propose an efficient ranking scheme which strengthens differences between community members according to their content similarity to community topics. When choosing the highest nodes in our new method, the high quality of new labeled seeds is ensured by taking the lexical similarity between node and seeds into account. The experimental results indicate that the content combined approach can effectively handle a variety of data sets on increasing the size and quality of the extracted community and rank community pages more reasonably.