针对N-gram在音字转换中不易融合更多特征,本文提出了一种基于支持向量机(SVM)的音字转换模型,有效提供可以融合多种知识源的音字转换框架。同时,SVM优越的泛化能力减轻了传统模型易于过度拟合的问题,而通过软间隔分类又在一定程度上克服小样本中噪声问题。此外,本文利用粗糙集理论提取复杂特征以及长距离特征,并将其融合于SVM模型中,克服了传统模型难于实现远距离约束的问题。实验结果表明,基于SVM音字转换模型比传统采用绝对平滑算法的Trigram模型精度提高了1.2%;增加远距离特征的SVM模型精度提高1.6%。
In order to overcome the difficulty in fusing more features into n-gram, a Pinyin-to-Character conversion model based on Support Vector Machines (SVM) is proposed in this paper, providing the ability of integrating more statistical information. Meanwhile, the excellent generalization performance effectively overcomes the overfitting problem existing in the traditional model, and the soft margin strategy overcomes the noise problem to some extent in the corpus. Furthermore, rough set theory is applied to extract complicated and long distance features, which are fused into SVM model as a new kind of feature, and solve the problem that traditional models suffer from fusing long distance dependency. The experimental result showed that this SVM Pinyin-to-Character conversion model achieved 1.2% higher precision than the trigram model, which adopted absolute smoothing algorithm, moreover, the SVM model with long distance features achieved 1.6 % higher accuracy.