提出一种不依赖于词典的抽取文本特征词的桥接模式滤除算法(BPFA)。该算法统计文本中的汉字结合模式及其出现频率,通过消除桥接频率得到模式的支持频率,并依此来判断和提取正确词语。实验结果显示。BPFA能够有效提高分词结果的查准率和查全率。该算法适用于对词语频率敏感的中文信息处理应用,如文本分类、文本自动摘要等。
This paper put forward a bridge-connection patterns filtering algorithm (BPFA) for extracting high-frequency words without thesaurus. Firstly, the frequencies of co-occurrence patterns of Chinese characters were counted from documents, then the bridge-connection frequencies were eliminated and therefore obtains the support frequencies of patterns. Afterwards, the words were identified and acquired according to the support frequencies instead of the primary appearing frequencies. The experimental results show that BPFA can improve both precision and recall of extracted lexical set to some extent. This algorithm can be applied to text categorization and automatic summarization.