本文针对互联网的数据量的不断增加,准确搜索引擎的作用日益困难的问题,为了提高搜索引擎返回结果结构化聚类的效果,让信息的定位更迅速,本文采用基于标签的聚类算法,并使用自然语言处理技术中的依存句法分析和词典资源,深度挖掘语义结构,提出基于优化初始选择的K均值聚类方法.本文深入分析K均值聚类算法特点,并利用类别标签技术对该算法进行有效改进.实验证明该算法不仅在效果上优于一般聚类算法,对结果描述也有很大帮助,在效率上也得到很大提高.
Along with the constant development of the Internet and the ever-increasing amount of data,the role of search engines has become increasingly evident.More users rely on search engines to find the information needed.In order to more effectively cluster the search results,thus facilitating the positioning of information among the original unstructured results,a new label-based clustering algorithm is introduced in this paper.The key idea is to use the dictionary resource and Dependency Syntax Parsing in NLP to extract the ontologies related to the query. These extracted ontologies will further guide the choosing of centroids in K-means clustering. Furthermore, the various features of K-means algorithm have been fully investigated, and a way of improvement is proposed by using the cluster labels. Experiments show that this algorithm not only yields more effective cluster results but also provides more informative descriptions of the results;meanwhile,the efficiency has also been largely improved.