蒙古文字符编码与字形之间的多对多复杂转换关系及录入不规范等众多原因导致原始语料存在严重的拼写多样化现象和字形拼写错误,成为大数据处理瓶颈。该文以蒙古文输入法为例,利用大词库和形码生成器,将原本基于读音正确的词晶格最佳路径搜索问题转换为基于形码词晶格路径搜索问题,很好地解决了原始文本统计建模问题。实验结果证明,该方法及字形归并的模型优化方法可显著提高输入效率,对所有蒙古文“音词转换”和“形词转换”研究都有广泛的参考价值。
The Mongolian language model for its text is challenged by the same character with different codes owing to the different pronunciations of the character in various contexts. To address this issue for spelling input, this pa per adopts a large dictionary with correct pronunciations, training a statistical spelling model to maximize the the pronunciation sequence directly from the candidate code sequence. Experiments indicate a more efficient spelling in put method is achieved, which is also enlightening for "pronunciation-to-word" coversion and "spelling-to-word" conversion.