博客是web环境中个人表达观点和情感的一种重要载体,一般涉及较宽泛的话题,蕴含丰富的舆情信息。现有针对有关社会事件的用户产生内容进行情感分析的研究多数以篇章级为处理粒度,尚不能满足博客文本深度情感分析的需求。该文提出一种基于LDA话题模型与Hownet词典的中文博客多方面话题情感分析方法。该方法首先利用数据语料训练LDA话题模型,然后以滑动窗口为基本处理单位,利用训练好的LDA模型对博客文本进行话题识别与划分;在此基础上,基于Hownet词典对划分后的话题段落进行情感倾向计算。该方法有助于同时识别博客文本所涉及的多方面子话题及每个子话题上的情感倾向。实验结果表明,该方法不仅能获得较好的话题划分结果,也有助于改善情感分析的准确率。
Weblog is an important media for people to express their personal opinions and sentiment, which generally involve several topics or implied public opinions. The existing sentiment analysis researches on these user generation content are mostly in document level instead of fine granalarities. This paper proposes a novel method based on LDA topic model and HowNet lexicon to determine the sentiment orientation of blogs with multi-aspect topics. The new method utilizes data corpus to train the LDA topic model at first. Then it identifies and segments topics with the trained topic model, which taking a slide window as the basic processing unit. After that, the topics of paragraphs can be identified. And then the method conducts the sentiment analysis on topic paragraphs with HowNet lexicon. The new method can help to simultaneous identify multi-aspect topics and the sentiment orientation of these topics. The experiment results show that this approach can not only obtain a good topic partitioning results, but also help to improve sentiment analysis accuracy.