高维数据中进行各种处理时所需样本数量会成指数级增加,同时样本间距离的价值也逐渐减小,将导致维数灾问题。文本标签数据通常会面临数据维数过高的问题,会影响用户对垃圾标签的检测。文中借助支持向量机的数学模型构建出针对Folksonomy的大规模垃圾标签检测模型。为了减少检测垃圾标签时维数过高的影响,在核主成分分析理论的启发下,将数据降维思想引入数据约简领域,提出基于核主成分分析法的大规模SVM数据集约简模型。最终实例化形成一种新的垃圾标签检测方法,即基于核主成分分析支持向量机( KPCA-SVM)的大规模垃圾标签检测模型。该模型在垃圾标签检测中可以在不影响数据特征的前提下,缩短模型的测试时间且检测性能良好。
The needed sample will increase exponentially when processing high-dimensional data,the value of the distance between the sample also gradually reduced at the same time,which will lead to the dimension disaster problem. Text label data usually face this prob-lem of high-dimensional data,it will affect the users to detect social spam. In this paper,take advantage of the mathematical model of Support Vector Machine ( SVM) to construct the large-scale social spam detection model for Foklsonomy. In order to reduce the influ-ence of high-dimensional data,inspired by the kernel principal component analysis theory,the ideas of data dimension reduction are intro-duced,the large-scale SVM data set reduction model is proposed which is based on kernel principal component analysis. Finally form a new social spam detection method,the large-scale social spam detection model based on kernel principal component analysis and support vector machine. This model would not affect the characteristics in the social spam detection,and it will shorten the test time and have a good detection performance.