针对维吾尔文书写的数字文本的犯罪取证,提出一种基于文本分类的维吾尔文数字取证方案。首先,对维吾尔文文本进行预处理,滤除文本中非维吾尔文字符和停用词;然后,提出一种多特征空间正则化互信息(M-FNMI)算法,使用输入特征组合与类之间的互信息(MI)来代替单个特征与类之间的MI,从而提取出更准确的特征词;最后,利用支持向量机(SVM)算法来对特征进行分类。实验结果表明,该方案具有较高的分类精度,能够为犯罪取证提供判断依据。
For the crime forensics of digital texts written in Uighur,a Uyghur digital forensic scheme based on text categorization is proposed. The Uyghur texts are preprocessed to filter the non Uyghur characters and stop words. A multi-feature space normalized mutual information(M-FNMI)algorithm is proposed. The mutual information(MI)between input feature combination and class is used to replace the MI between the single feature and class,so as to extract more accurate feature words. The support vector machine(SVM)algorithm is used to classify those features. Experimental results show that the proposed scheme has higher classification accuracy,and can provide a basis for criminal evidence collection.