在大数据时代,文本挖掘面临特征的"高维-稀疏"问题,海量文本词汇与稀少关键特征间的矛盾导致了高时空复杂度和低效率等问题,严重制约了文本挖掘效率,因此在文本挖掘前进行有效的数据预处理至关重要。传统文本挖掘算法在数据预处理阶段只进行分词和去停用词操作。为提高性能,提出基于词频统计规律的文本数据预处理方法。首先,基于齐普夫定律和最大值法推导同频词数表达式;然后,基于同频词数表达式探究各频次词语在文中的分布规律,结果表明词频为1和2的词语与文档的关联度较低,但比重高达2/3;最后,基于词频统计规律进行数据预处理,在预处理阶段去除低频词,减小特征维度。在公共数据集Reuters-21578和20-Newsgroups上进行的实验的结果表明,各频次词语的分布规律是正确的,基于词频统计规律的文本数据预处理方法在分类准确率、精确率、召回率以及F1度量值方面均有提升,运行时间明显降低,文本挖掘效率得到显著提高。
In age of big data,it is a severe problem that feature terms are faced with"high-dimension and sparse"challenge in text mining.Contradiction between enormous scale of terms and scarce of features will cause high-time-space complexity and poor efficiency,and restricts the efficiency of text mining seriously.Thus,it is crucial to preprocess data before mining text.Terms-dividing and stop-words-deleting are operated merely in data preprocessing of traditional text mining algorithms.In order to improve process of data preprocessing,data preprocessing algorithm based on term frequency statistics rules(DPTFSR)was proposed.To begin with,expression about number of terms with identical frequency is deduced based on Zif's Law and rule of maximum area.What's more,regularities of distribution based on terms with identical frequency is explored.It is discovered that proportion of low-frequency terms in documents reach up to 2/3,but there is little relevancy between them.Lastly,data is preprocessed based on terms frequency statistics rules.Low-frequency terms are deleted,and features dimension is decreased greatly.Correctness of term frequency statistics rules and validity of algorithm DPTFSR are verified on data sets from Reuters-21578 and 20-Newgroups.Experimental results show that accuracy,precision,recall and F1 measure are increased,and running time is shortened obviously.Thus,efficiency of text mining is significantly enhanced.