以大规模网络维吾尔文文本的自动分类技术研究为背景,设计模块化结构的维吾尔文本分类系统,在深入调研基础上选择NaiveBayes算法为分类引擎,用.C#实现分类系统。预处理中,结合维吾尔语的词法特征,通过引入词干提取方法大大降低特征维数。在包含10大类共计3000多个较大规模文本语料库基础上给出分类实验结果,再通过x2统计方法选择不同数目的特征,也分别给出分类实验结果。结果表明,预处理后的维吾尔文特征空间中只有1%-3%特征是最佳的,因而进一步确定哪些是最佳特征或降低特征空间维数是有可能的。
In this paper, taking the automatic classification of large-scale Uyghur text collected from the network as the research background, we have designed the Uyghur text classification system with modular structure, and based on through investigations, we chose the Naive Bayes algorithm as the classification engine, and have implemented the classification system using C-sharp. In the preprocessing part, combining with the lexical characteristics of Uyghur language and by introducing the stem extraction method into the procedure, we have greatly reduced the whole feature dimensions. The classification experimental results on the basis of large-scale text corpus includes more than 3000 documents which are belongs to different 10 categories are given, and the results of the classification experiments for different number of features selected by using x2 statistical method are also given respectively. Results show that only 1% to 3% of the features in Uyghur feature space are critical, so it is possible to determine which ones are the best features or to further reduce the feature space dimensions.