从海量非规范Web数据源提取大规模高质量的社会网络有着广阔应用前景和较高学术价值,同时也面,临着海量计算所带来的巨大挑战。为此,以Digg新闻评论网站为信息源,以提取网站用户之间的共同兴趣网络为主要目标,提出了基于云平台的社会网络提取系统框架,实现了基于Mapreduce的大规模社会网络提取方法。实验结果表明,提出的方法具有较好的扩展性和伸缩性,能够胜任从异构Web数据源提取高质量的大规模社会网络的计算任务。
Extracting large-scale social networks from massive heterogeneous Web data is of both theoretical and practical significance. However,one of definite features of this task was large-scale computing, which remains to be a great challenge that would be addressed. Cloud computing platform had provided us new opportunity to overcome this challenge. Hence, efforts would be dedicated to investigate the methods to extract large social network from Web data by cloud computing techniques. Specifically, proposed a Mapreduce-based approach to extract common interest network from DIGG. The experimental results show that the proposed method has good scalability and cxtensibility, having the capability to extract large-scale social network of high quality from heterogeneous Web data sources.