维吾尔文和阿拉伯文是采用阿拉伯文字母的从右向左书写的连写文字。它们识别方法的研究对于多文种文本图像内容的利用具有重要意义。利用HTK工具包,分别建立基于隐马尔科夫模型HMM(Hidden Markov Model)的印刷体维吾尔文和阿拉伯文识别系统,其中特征提取部分采用分布密度特征和局部方向特征。研究利用HTK工具建立维吾尔文和阿拉伯文统计语言模型,并将语言模型用于改进识别系统性能。实验结果表明采用统计语言模型可有效提高文字识别系统性能。其中,在包含24 000个单词的印刷体维吾尔文测试集上,通过利用语言模型识别率从78.28%提高到97.45%;在包含759个单词的印刷体阿拉伯文测试集上,通过利用语言模型识别率从79.07%提高到85.80%。
Uyghur and Arabic languages are the cursive characters using Arabic letters and written from right to left. The study on their recognition methods is of great significance to the use of the content in multilingual texts and images. We establish in the paper the recognition systems for printed Uyghur and Arabic text and images respectively based on hidden Markov model (HMM) by using HTK tools. In it the features extraction component adopts distribution density features and local directional features. In this paper, we also study to build statistical language models of Uyghur and Arabic respectively by using HTK tools as well, and apply the language models to improving the performance of recognition systems. Experimental results demonstrate that the use of statistical language models can effectively improve the performance of characters recognition system. Among them, on the test set of printed Uyghur containing 24 000 words, the recognition rate increases from 78.28% to 97.45% by using language model, and on test set of printed Arabic containing 759 words, the recognition rate increases from 79.07% to 85.80% by using language model.