东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

一种基于判别式重排序的拼写校正方法

期刊名称：软件学报，2008.2
时间：0
分类：TP391[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]
作者机构：[1]天津大学计算机科学与技术学院,天津300072, [2]香港科技大学计算机系,香港, [3]微软亚洲研究院,北京100080
相关基金：Supported by the National Natural Science Foundation of China under Grant No.60603027 （国家自然科学基金）; the Science- Technology Development Project of Tianjin of China under Grant No.04310941R （天津市科技发展计划）; the Applied Basic Research Project of Tianjin of China under Grant No.05YFJMJC 11700 （天津市应用基础研究计划）致谢在此,我们向对本文的工作给予支持和建议的同行,尤其是上海交通大学的包胜华、袁伟以及重庆大学的陈议同志表示感谢.
相关项目：基于信息几何方法的维数约减和信息抽象模型研究

关键词：拼写校正, 判别模型, 重排序, 日志挖掘, 查询链, spelling correction, discriminative model, reranking, log mining, query chain

中文摘要：

提出一种基于判别模型的拼写校正方法坨针对已有拼写校正系统Aspell的输出进行重排序，使用判别模型Ranking SVM来改进其性能．将现今较为成熟的拼写校正技术（包括编辑距离、基于字母的n元语法、发音相似度和噪音信道模型）以特征的形式整合到该模型中来，显著地提高了基准系统Aspell的初始排序质量，同时性能也超过了一些商用系统（如Microsoft Word 2003）的拼写校正模块．此外，还提出了一种在搜索引擎查询日志链中自动抽取拼写校正训练对的方法、基于这种方法训练的模型获得了基于人工标注数据所得结果相近的性能，它们分别将基准系统的错误率降低了32．2％和32．6％.

英文摘要：

This paper proposes an approach to spelling correction. It reranks the output of an existing spelling corrector, Aspell. A discriminative model （Ranking SVM） is employed to improve upon the initial ranking, using additional features as evidence. These features are derived from state-of-the-art techniques in spelling correction, including edit distance, letter-based n-gram, phonetic similarity and noisy channel model. This paper also presents a method to automatically extract training samples from the query log chain. The system outperforms the baseline Aspell greatly, as well as the previous models and several off-the-shelf systems （e.g. spelling corrector in Microsoft Word 2003）. The experimental results based on query chain pairs are comparable to that based on manually-annotated pairs, with 32.2%/32.6% reduction in error rate, respectively.

同期刊论文项目