认为因短文本具有特征稀疏性和高度冗余性,微博短文本的预处理及学习方法研究已经成为微博信息挖掘及应用的关键,并在许多方面有着非常重要和广泛的应用。重点分析微博短文本的特性,并对微博短文本的预处理和学习方法及其应用现状进行归纳和总结,包括短文本特征表示、短文本特征拓展与选择、短文本分类与聚类学习、热点事件发现及自动文摘等。最后指出相关研究的局限性,并对未来的发展方向进行展望。
As the features of microtext are sparse and highly redundant, the pre-processing and learning methods are the key problems of the data mining for microblog, and have a very important and wide application in many ways. The paper analyzes the characteristics of the microtext, and conducts an introduction and summarization to pre-processing and learning methods and their applications, including short text representation model, short text feature expanding and selection, classification and clustering for short text, hot events detection and automatic summarization, and so on. At last, this paper also proposes the limitations of the recent study, and points out the directions for future research.