针对微博短文本特征稀疏导致文本相似性度量不精确的问题,提出一种基于多视角的微博短文本相似度算法。根据词形相同与词义相近寻找微博短文本中的公共块,以公共块所含词项总数与公共块之间的组合顺序,构建基于公共块序列的语义相似度;利用微博短文本发布时间、转发与评论等信息来修正该语义相似度,形成新的微博短文本相似度算法,度量微博短文本之间的相似性;将新的微博短文本相似度算法融入Single-Pass聚类算法中以检测微博话题。实验结果表明,将该算法应用于微博话题检测时,能够有效降低话题检测的平均漏检率与误检率等,提高了话题检测的质量。
For the inaccuracy problem of Micro-blog short text similarity calculation caused by sparse features,a method of Micro-blog short text similarity based on multiple views was proposed.Common blocks between short texts were found according to the same word in form or the similar word in meaning,and short text semantic similarity model based on common block sequence was newly established by combining the total number of words within common blocks with order between common blocks.The creating time of Micro-blog short texts and the structured information such as forwarding and commenting were used to revise short text semantic similarity model to construct a novel method of Micro-blog short text similarity,commonly measuring the similarity between Micro-blog short texts.The algorithm was combined with Single-Pass clustering algorithm to detect Microblog topics.Experimental results show that when applying the method into Micro-blog topic detection,the average missing rate and false detection rate of topic discovery were effectively reduced,improving the quality of topic discovery.