随着微博用户数量的上升,微博信息量成倍增长,基于冗杂的微博信息向微博用户快速推荐感兴趣的好友是不容回避的技术问题。针对这一问题,基于微博大数据,以Hadoop为平台,HBase为基础,MapReduce为编程框架,提出了基于Apriori算法与Item—based协同过滤算法的组合算法,并构建了推荐好友系统。该系统通过Apriori算法对冗杂的微博内容记录进行频繁项集的计算,得出能表达用户喜好的标签,以提升系统的时间性能;通过Item—based算法对标签进行匹配推荐,以缩短系统的推荐时间以及资源占用率。为了验证所构建系统的有效性和可靠性,分别进行了两组对比实验,第一组实验为添加了Apriori算法的协同过滤算法与传统协同过滤算法在时间性能方面的对比测试,第二组实验则为Apriori算法混合Item—based协同过滤算法与混合K—means算法的对比测试。实验结果表明,在庞大的微博容量下,与传统协同过滤算法相比,所提出算法的运行时间缩短了24%-44%;与混合K—means聚类算法相比,所提出算法在算法运行时间和CPU占用率均有1.2—1.5倍的提升。可见,提出的算法可显著缩短推荐时间,减少资源消耗率,提高推荐效率。
With the rising of micro-blogging users, microblog information capacity has grown rapidly. Fast recommendation of interested friends for micro-blogging users based on the jumbled microblog information becomes inevitable problem. Therefore faced with massive data of microblog, with Hadoop as platform and MapReduce as program frame and based on HBase, a hybrid algorithm of Apriori & Item -based collaborative filtering recommendation algorithm has been proposed and a recommended friends system has been established, in which system computation of frequent item set with massive microblog content records has been conducted to express users' favorites with tags for promotion of its time performances via Apriori algorithm and thus recommendation of tags has been matched via Item-based algorithm for decrease of recommendation time and occupancy rate of system resource. In order to verify its effectiveness and reliability, two groups of contrast experiments have been conducted, in which the first one involves contrast tests of time performances with collabo- rative filtering algorithm based on Apriori algorithm vs traditional collaborative filtering algorithm and the other one is composed of con- trast tests of hybrid algorithm combined Apriori algorithm with Item-based collaborative filtering algorithm vs hybrid K -means algo- rithm. The results of contrast experiments show that in large micro-blogging capacity, compared with hybrid K -means clustering algo- rithm ,the proposed algorithm has decreased the running time by 24% -44% and has lifted 1.2 - 1.5 times in operation time and CPU oc- cupancy rate. Obviously, the time and recommended resource consumption can be greatly reduced and efficiency recommended improved for proposed algorithm.