互联网等信息技术的迅猛发展使网络中积累了大量半结构化和非结构化的文本数据,如何从这些海量电子文档中获取需要的信息并以高效直观信息图的形式展现,成为统计分析工作者的一项主要任务。文字云是信息图表达的一种新型文本显示方式,利用文字云和主题模型文本挖掘方法,对文本进行移除数字、去除停用词等预处理操作,然后执行中文分词,构建语料库,建立文档-词条矩阵,最后以文字云和主题模型的形式呈现挖掘结果。实验中主要利用R语言,以多年粗糙集会议纪要为实验数据进行了相关统计分析,并对比了 Tagxedo文字云生成器,结果表明,从文字云中比较容易获取文本的重要信息如主题模型等,挖掘效果较好。
With the rapid development of internet and other information technologies , networks are accumulated with vast semi-structured and unstructured text data .It will be a primary mission to statistical analysis workers that how to get the required informa-tion, and show it with an efficient and visual information graph from those massive electronic documents .Word clouds is a new text displaying way of information graph expressing .In the present work, we make some pretreatment of removing the number and the stop word in the text by a text mining method of word clouds and topic model .Then, we make Chinese word segmentation , build corpus and set up document-term matrix.Finally, we present the mining result with word clouds and topic model .The experiment statisti-cally analyses the data of the rough set conference summaries using R language and make a contrast with word cloud generator of Tagxedo.These results indicate that the method of this paper has a better effect in mining and easy acquire important information from text, such as topic model.