针对文本图像特有的图像特征.提出了一种基于底层图像特征组合的文本图像分类方法,该方法使用了两层C4.5决策树分类器,能将文本图像有效地分为标题文本图像、文档图像和场景文本图像.首先将样本图像转换为灰度图像,提取灰度直方图的特征,根据灰度直方图特征的不同。可以先区分文档图像:然后把余下的图像转换为二值图像,提取图像的GLCM纹理特征,根据GLCM特征区分场景文本I和标题文本图像.在开源的WEKA数据挖掘软件环境下进行仿真实验,结果表明该方法是可行的。并能够得到较高的查全率和查准率.
A text image classification method based on the combination of underlying image feature was proposed in this paper. With two layers of C4.5 decision tree classifier, the method can divide the text image into caption text image, document image and scene text image. The text image classification is a two-step process. In the first place, the sample image is converted into gray image for histogram feature extraction. Document images could then be well distinguished according to the variable characteristics of the gray histogram. In the second place, the rest of the images are converted intb binary images to extract their GLCM features, according to which the scene text and caption text images are distinguished. Simulation experiments were carried out in the open source WEKA data mining software, the results showed that the method is feasible, and is able to get favorable recall and good precision ratio.