对协议未知条件下比特流数据集的聚类是进一步开展未知协议识别的基础。从比特流的统计特征出发,提出了三种协议不相关的比特流特征参数:压缩率、汉明重和游程频数。针对k-means算法对初始聚类中心的敏感问题,提出了一种基于距离累加和的初始聚类中心选择方法,并采用k-均值算法对实际采集的比特流数据集进行了聚类。实验结果表明,所定义的特征参数可有效用于未知协议比特流聚类,提出的初始聚类中心选择方法可以提高k-均值算法的稳定性和执行效率。
Unknown protocol bitstream clustering is the foundation of further protocol identification.From a statistical point of view,three bitstream characteristic parameters not related to protocol,including compression ratio,Hamming weight and runs frequency,were put forward.To address the sensitive issue of k-means algorithm about the initial clustering centers,an initial clustering center selection method based on distance accumulation was proposed.And the bitstream data sets collected from the real network environment were clustered based on k-means algorithm.Experimental results demonstrate that the defined parameters can be effectively used to the process of unknown protocol bitstream clustering,and the proposed initial clustering centers selection method can improve the stability and execution performance of k-means algorithm.