社会化标注系统产生了大量歧义的、不受控制的标签,不仅会降低用户的体验,而且会限制资源的利用效率。标签聚类能够把具有相近语义的标签聚集在一起,反映标签的潜在语义结构,从而有效缓解上述问题。传统的标签聚类方法通常只利用资源的被标注信息进行聚类,由于忽略了用户的标注信息使得聚类结果不能表达准确的语义。本文提出一种基于LDA(Latent Dirichlet Allocation)模型的社会化标签综合聚类方法,该方法分别利用用户的标注信息和资源的被标注信息来建立主题学习模型,通过学习,获取基于用户的标签潜在主题和基于资源的标签潜在主题,综合标签在这两类主题上的概率分布结果,建立标签主题的二次学习模型,学习出标签的混合主题并在此基础上判定标签的聚类簇。与传统方法相比,本文的方法不仅可以有效地利用标签之间的语义关系,而且能够在一定程度上缓解传统标签聚类方法所面临的高维和稀疏性问题。实验结果表明,本文的方法具有较好的效果。
Social tagging systemsproduces plenty of ambiguous and uncontrolled tags. These tags not only worsen users' experience but also restrict resource's retrieving efficiency. Tag clustering could aggregate tags with similar semantics together, and help alleviate the above problems. Traditional tag clustering methods usually utilize the resource's annotated information to aggregate tags. But their clustering results cannot address accurate semantics because these methods do not consider the user's annotating information. In this paper, we propose a social tag comprehensive clustering method based on LDA model. We first utilize the user's annotating information and the resource's annotated information to construct two LDA topic learning models respectively. The two LDA models are user-based tag topic model and resource-based tag topic model. Then, the re-learning model of tag topic is constructed by compositing the tag's probability distribution results on user-based tag latent topics and resource-based tag latent topics. In this environment, the mixture topics of tags will begenerated by iterative learning. Finally, the cluster of tags will be decided according to their maximum probability on topics. Compared with traditional tag clustering methods, our method utilizes the semantic relation of tags effectively, and mitigates the high-dimensional and sparse problems faced by traditional methods to some extent. Experimental results show that the proposed method has a better effect.