丰富的实体关联关系是在异构信息空间中进行数据分析、数据挖掘、知识发现和语义查询等许多应用的前提条件和关键所在.然而不同于同构信息网络,由于异构信息空间中实体关联关系的复杂性、多样性和异构性使得实体关联关系挖掘并不是一件简单的任务,更具有挑战性.以作者文献网络为例,提出了一个通用的,由聚类、过滤、推理和量化4步骤组成的异构信息空间中基于聚类的实体关联关系挖掘算法CFRQ4A(clustering,filtering,reasoning and qualifying for associations).CFRQ4A算法不仅利用了异构实体自身的属性值,还利用了异构信息网络的结构(路径)信息;在挖掘过程中引入关联关系约束来保证关联关系的语义和逻辑正确性,并且针对实体关联关系的特点提出了关联强度量化模型.在真实数据集DBLP上的实验结果表明所提出算法是可行和有效的.
The rich entity associations are prerequisites and play important roles in many applications such as data analyzing,data mining,knowledge discovery and semantic query in heterogeneous information spaces.However unlike homogeneous information network,due to the complexity,diversity and heterogeneous of entity associations in heterogeneous information spaces,the entity association mining is not a simple task and with more challenges.It is taken as an example to discover the likely entity associations among heterogeneous entities in an author bibliographic network.In particular,aiming at the characteristics of heterogeneous information spaces,a new general 4-step entity association mining algorithm CFRQ4A (clustering,filtering,reasoning and quantifying for associations) is proposed.CFRQ4A leverages not only attribute values of heterogeneous entities but also structural (path) information of heterogeneous information network.And association constraints are introduced to verify semantic and logic correctness of entity associations in the mining process.The purpose of the filtering step is to further reduce the searching space of the mining algorithm.Moreover,aiming at the inherent features of entity association,a reasonable association strength quantifying model is given.Experimental results on the DBLP dataset demonstrate the feasibility and effectiveness of the proposed algorithm.