Blog文章对应了大量评论信息,评论中又包含大量的噪声,因此如何结合Blog评论获取Blog文章的主要内容是许多基于Blog的应用所要面临的难题。以往提出的文摘方法大多是针对多文档文摘的通用方法,并未考虑Blog文章的特殊性,无法有效地结合评论来处理文章。该文通过分析Blog的特点提出了一种新的结合评论信息的Blog文摘方法。该方法首先基于特征计算出评论的权重,然后结合图模型使用HITS算法得到正文句子权重,进而得到文摘句。通过在凤凰博客数据集上的实验表明,该文方法在ROUGE测度上优于以往方法。
Since blog contains many comments involving massive noise,how to summarize the content of blog posts together with the comments is a difficult task for many blog applications.The previous works for textual document summarization are mostly for multi-document summarization in general.Without taking the particularity of blog into account,the previous works are inefficient for blog posts with comments.This paper proposes a novel summarization approach for blog based on the characteristics of the blog posts in which the information of comments are well considered.We first calculate the weights of the comments based on multi-features of the comments.Then we calculate the weights of the sentences in blog post based on HITS model.Finally we select sentences from the blog post according to their weights.We conduct an experiment on the dataset of Ifeng blog,and it shows that our approach works better than some previous works in terms of the score of ROUGE.