传统的K均值聚类算法采用欧式距离计算样本间的相似度,由于未考虑不同样本属性对于衡量样本间距离区分度的重要性,导致相似度计算不准确,聚类性能较差。提出了一种改进的K均值聚类算法,通过计算每个属性相对于聚类类别的信息增益率,将信息增益率作为属性权重计算加权欧式距离,使对类别区分度贡献较大的属性拥有较大的权重,以提高样本间的相似性度量的准确性。在经典的入侵检测数据集UCI KDD CUP上的实验结果证明,与传统的基于K均值的入侵检测方法相比,此方法能够有效地提高检测准确率。
Euclidean distance is used to calculate similarity between samples by traditional K-means clustering algorithm.The importance of different attributes is not considered.As a result,the sample’s distance measurement is not accurate;the quality of clustering is bad.To solve the problem,an improved k-means clustering algorithm was proposed.By calculating every attribute’s information gain ratio with respect to clustering class,take the information gain ratio as weight to calculate Euclidean distance.In this way,the attributes which more contributing to classify get more weight,the measurement between samples is more accurate.By experiments on classic UCI KDD CUP intrusion detection dataset,the result shows that,comparing with traditional k-means intrusion detection method,the method proposed by this paper can effectively improve detection accuracy.