为了处理网络日志规模过大及其相关问题,根据TCP传输协议的特征,提出一种基于网格的TCP网络日志二次聚类算法(Grid-based TCP Two-step Clustering,GTTC)。通过分析TCP连接过程,采用划分网格的方式把每一条TCP报文日志进行网格内初次聚类,再把初次聚类簇进行网格间二次聚类,最后生成表达整个TCP连接的唯一的日志记录。该算法结合数据库技术,不需预设类簇个数k,可以自主决定生成的类簇。另外,该算法还可以处理实际的动态数据,实现增量式聚类,删除已聚类数据,处理新来网络日志。真实的网络环境测试证明,该算法既大大压缩了TCP日志记录存储量,又保证了日志记录的完整性和准确性,并且不影响用户的正常网络通信。
To deal with large scale network log and related issues,we propose a grid-based TCP log two-step clustering algorithm according to the characteristics of TCP transmission protocols.Through the analysis of TCP connection process,every TCP datagram log is firstly clustered in the grid,and then secondly clustered among grids to form the only one log expressing the process of TCP connection.Combining with database technology,this algorithm can automatically generate clusters,without presetting the number of clusters k.In addition,the algorithm can deal with actual data,complete incremental dyadic clustering,and delete clustered data.Through real network environment test,the algorithm greatly compresses the TCP log storage,ensures the completeness and accuracy of log,and does not affect the normal user network communication.