面向Web对象的细粒度聚类已经成为学术界研究的热点.然而现有大多数聚类模型只关注如何对文本内容或文章主题进行聚类,聚类结果粒度较粗,无法满足大规模网络信息检索的质量要求.针对上述挑战,充分挖掘Web文档中词汇间的树状概率层次关系,提出一种以词汇信息分布作为特征标志的聚类算法InfoSigs,实现对Web对象的细粒度聚类.算法构建一个信息传递有向无环图,根据词汇在图中信息分布的集中度赋予其合理的权重,产生更具代表性的特征向量;同时算法提出了一个自适应的记录合并模型,有效提高记录簇中记录间的相似度,减少噪音对合并过程的影响.实验结果表明,InfoSigs算法比传统聚类算法—I-Match和Shingling—在F-Measure值上平均约有21.3%的提高,可以有效地运用到多领域Web对象的聚类问题.
Clustering of objects in Web(IR) documents has recently become a hot topic in the research community of Web information retrieval(IR) Generally,quality Web IR requires fine-grained clustering of objects in documents However,the present clustering algorithms are mostly confined to the level of sentence structure or textual topic The lack of consideration of token information for identifying more detailed-level objects often leads to coarse-grained clustering results To address this problem,the authors propose a novel fine-grained clustering algorithm named InfoSigs,which captures the token information signatures inside Web documents The work contains two contributions:Firstly,techniques are presented to construct a directed acyclic graph of information-transmission from token frequency sequences implying probabilistic hierarchy property between tokens Each token feature is given a weight value based on the aggregated information distribution obtained from the signatures in the graph Secondly,a self-tuning method is proposed for merging records that are of high similarity This can effectively reduce the impact from noises The experiments on real datasets show that the proposed InfoSigs algorithm outperforms the conventional algorithms,such as I-Match and Shingling,with average improvements of 213% in terms of the F-Measure The results indicate that InfoSigs is able to effectively generate more fine-grained clustering results compared with the conventional methods