提出一种基于判别模型的拼写校正方法坨针对已有拼写校正系统Aspell的输出进行重排序,使用判别模型Ranking SVM来改进其性能.将现今较为成熟的拼写校正技术(包括编辑距离、基于字母的n元语法、发音相似度和噪音信道模型)以特征的形式整合到该模型中来,显著地提高了基准系统Aspell的初始排序质量,同时性能也超过了一些商用系统(如Microsoft Word 2003)的拼写校正模块.此外,还提出了一种在搜索引擎查询日志链中自动抽取拼写校正训练对的方法、基于这种方法训练的模型获得了基于人工标注数据所得结果相近的性能,它们分别将基准系统的错误率降低了32.2%和32.6%.
This paper proposes an approach to spelling correction. It reranks the output of an existing spelling corrector, Aspell. A discriminative model (Ranking SVM) is employed to improve upon the initial ranking, using additional features as evidence. These features are derived from state-of-the-art techniques in spelling correction, including edit distance, letter-based n-gram, phonetic similarity and noisy channel model. This paper also presents a method to automatically extract training samples from the query log chain. The system outperforms the baseline Aspell greatly, as well as the previous models and several off-the-shelf systems (e.g. spelling corrector in Microsoft Word 2003). The experimental results based on query chain pairs are comparable to that based on manually-annotated pairs, with 32.2%/32.6% reduction in error rate, respectively.