词干化、词形还原是英文文本处理中的一个重要步骤。本文利用3种聚类算法对两个Stemming算法和一个Lemmatization算法进行较为全面的实验。结果表明,Stemming和Lemmatization都可以提高英文文本聚类的聚类效果和效率,但对聚类结果的影响并不显著。相比于Snowball Stemmer和Stanford Lemmatizer,Porter Stemmer方法在Entropy和Pu-rity表现上更好,也更为稳定。
Stemming or lemmatization is a key step in English text processing.Utilizing 3 clustering algorithms,this paper makes a comprehensive experiment on 2 stemming algorithms and 1 lemmatization algorithm.The experimental results show that both Stemming and Lemmatization can improve the effevtiveness and efficiency of English text clustering,but have little influence on clustering results.Compared with Snowball stemmer and Stanford lemmatizer,Porter stemmer has a better performance and is more stable in Entropy and Purity.