微博客是Web2.0出现以来的一个新生概念。著名的Twitter系统是微博客中具有代表性的一个,其全球用户已经超过1.6亿,在世界范围内具有重要影响力:目前知名政治家、社会名流和大企业几乎都是Twitter的用户。Twitter系统中的消息小于140个字符,而且语法不规范。同时,由于Twitter允许用户以多种格式自由转发消息,系统中存在大量内容重复或近似重复的消息。重复消息的存在加重了系统存储的负担,对用户阅读、理解以及分析消息的内容也造成了不利影响。该文分析了Twitter系统中转发消息的语法特点,并利用这些语法特点提取规则,把转发的消息变成普通消息。该文还提出统计字符种类和最短编辑距离两种字符串距离计算的方法以判定Twitter中近似重复的消息。该文还分析了Twitter消息发送的方式以及不同登录方式的消息特征。实验结果表明,两种方法具有扩展性强、实现简单、效率高等优点,能够有效地检测Twitter上的信息重复现象。
Microblog is a very new concept of web 2.0.The most important microblog system in use is Twitter,with more than 160 million users all over the world.For now,Twitter is one of the most influential voices of the globe,its users including celebrities,well-known politicians and first-order companies.The length of the messages in Twitter is short,and the contents of the messages are very likely to be informal in syntax or grammar.Moreover,Twitter does not strictly define the syntax of retweet,which causes the existence of a great number of near duplicate messages.These near duplicate messages can be a waste of storage resources,and can greatly reduce the user experience of Twitter.In this paper,the syntax of retweet messages is analyzed,and a method is presented to remove the retweet symbols of messages using the analyzed results.In addition,two text distance calculating methods character statistics and shortest editing distance are proposed to cluster the Twitter messages into groups of near duplicate messages.We also analyze the log-in method and characteristics of twitter's messages.Through a series of experiments,we prove that our methods are efficient,extensible and easy to implement,and can be used to discover and filter the near duplicate messages in microblogs.