为了解决从网络数据源提取的非结构化数据的处理问题,提出一种基于触发对的聚类算法TrigSigs,利用触发对挖掘非结构化数据中隐含属性间的关联关系作为辨别实体的标志.该算法能够聚集对辨别实体起到关键作用的特征组合,过滤噪音词汇,并且根据辨别实体的分辨力,为每个特征词汇赋予合理的权重,使记录的特征向量对辨别实体更具代表性,最终提高聚类结果的细粒度,很好地解决了非结构化数据的记录关联合并问题.实验结果表明:该算法可以过滤绝大部分噪音词汇,并且根据词汇的分辨力合理分配权重,使最终聚类结果的准确率有很大的提升.
A novel clustering algorithm named TrigSigs was proposed to overcome the problem of record linkage models for unstructured data from network.It focuses on mining the associations of hidden attributes as the signatures of objects in unstructured data by trigger-pair model.It can group tokens which help identify objects and filter out noise.Then it assigns weight to tokens properly which makes feature vectors more representative for identifying objects.After these steps,it gains fine-grained object-based clustering result from unstructured data.Experiments on real datasets show that this algorithm can filter out most noise and assign weight for features properly,and improves the clustering results greatly.