将每个短文本文档看成一个由文字、数字和标点构成的字符串,并基于字符串自身的特性直接计算其相似性,在此基础上进行短文本层次化聚类,进而发现网络舆情热点.由于这种方法免去特征提取和文本表示过程,在一定程度上避免了传统方法在短文本表示时特征向量稀疏的不足,有效解决了短文本内容聚类问题.实验结果表明,本文提出方法有效.
The unique language characteristic of short texts has made the performance of traditional natural language processing methods degradation, or even unavailable. Exact representation and calculation of the similarity between short texts are great helpful to content based clustering. That this paper treated each short text as a composition of characters, numbers and punctuation, and a similarity measure based on string similarity was proposed. Then a public opinion hotspot detection and analysis system based on short text hierarchical clustering was built. This method calculated the similarity directly which skipped the feature extraction and representation processing of short text, to a certain extent, and avoided using the sparse feature vectors. Experimental results show the effectiveness of the proposed method.