将概率潜在语义分析PLSA(probabilistic latent semantic analysis)和自适应广义粒子群算法AGPSO(adaptive general particle swarm optimization)相结合,提出了一种文本特征降维新方法,进而实现了基于PLSA和AGPSO的网页分类器。采用概率潜在语义分析将语义关系体现在VSM(Vector Space Model)中,通过EM算法有效地降低向量空间的维数;设计交叉操作模拟粒子飞行速度的变化,变异操作保持种群的多样性,同时引入自适应策略动态调整变异概率,以求最优特征子集。在用自适应广义粒子群算法约简前,先用概率潜在语义分析对原始特征空间约简,得到中间特征子集,然后再用自适应广义粒子群算法继续约简,充分发挥两者的优势。实验表明此算法能有效降低文本维数,提高分类精度。
A new method of text feature reduction is brought forward based on combining the probabilistic latent semantic analysis (PLSA) with adaptive general particle swarm optimization (AGPSO) , and then a PLSA + AGPSO-based webpage classifier is accomplished. In this paper, PLSA is used to embody semantic relationships in VSM (vector space model) , the dimension of eigenspaee can be reduced effectually by EM algorithm. A crossover operation is designed to simulate the flying velocity alteration of panicle and the mutation operation is used to keep the diversity of population. Besides, an adaptive strategy is introduced for dynamically adjusting the probability of mutation just in order to obtain optimal feature subset. Before applying general PSO to reduce text feature space, a middle attribute subset will be produced by using probabdlstlc latent semantic analysis on original feature space for its reduction and then adaptive general PSO is employed to continue the re- duction. Therefore, the benefits of these two means are adequately employed. Experimental results indicate that the algorithm can effectively reduce text dimension and improve categorization precision.