提高文摘自动生成的准确性,能够帮助人们快速有效地获取有价值的信息。本文根据政府公文结构性强的特点,提出一种基于句子权重和篇章结构的政府公文自动文摘算法,首先通过基于游标的截取字符分句算法,对文档中句子和词语信息进行精确统计,获得对文章内容和篇章结构的基本了解;在此基础上,提出基于篇章结构的词语权重和句子权重计算方法,并根据权重计算结果对句子进行权重排序;然后,根据生成摘要的规模,筛选出一定数量的候选文摘句子;最后,对候选文摘句子进行一定的后处理,输出文摘句。实验结果表明,与同类型自动文摘算法以及word2003提供的自动文摘工具相比,本文提出的自动文摘算法在准确率和召回率上都有较大提高。
To improve the accuracy of automatic text summarization can help people to obtain the valuable information simpler and more efficient. According to the structural characteristics of government documents, this paper proposed an automatic summariza- tion algorithm based on sentence weight and chapter structure. First, from the accurate statistics information of sentences and words in the document, the article content and a basic understanding of textual structure can be obtained. Then through the calcu- lation of words' weight and sentences' weight, sentences can be sorted. According to the size of the summarization, the candi- date summary sentences can be chosen. Finally, after doing some post-processing, the final sentences of the text summarization can be output. The results of experiment show that, compared with the similar algorithm, the accuracy rate and the recall rate in our algorithm are improved a lot.