结合大数据技术提升输变电设备状态评价的广度和深度,并解决实际应用问题成为目前电力行业新的挑战。针对输变电设备状态监测大数据可靠存储和快速访问两方面大数据处理核心问题,基于开源的Hadoop云计算实验平台进行了数据分布策略、数据块尺寸调优、集群网络拓扑规划三方面的存储优化研究和大数据并行分析的研究。提出计及数据相关性的多副本一致哈希数据存储算法,能将具有相关性的数据在集群中聚集,提升数据处理执行效率。基于数据相关性多副本一致哈希数据分布,应用Map Reduce并行编程模型设计实现了多数据源并行连接查询算法和多通道数据融合并行特征提取算法。将两种算法在实验室搭建的集群上测试运行,结果表明,多数据源并行连接查询的执行时间仅为标准Hadoop方案的32%,多通道数据融合并行特征提取算法执行时间仅为标准Hadoop方案的35%。
Applying big data technology for improving the condition evaluation of power transmission and transforming equipment and solving its practical problems becomes a new challenge in power industry. For high reliable storage and rapid access of data, the data distribution strategy, data block size adjustment and the cluster network topology are studied based on hadoop. A multi-copy consistency Hash algorithm based on data correlation(CMCH) is proposed. The algorithm makes the relevant data gathering in the cluster and improves the data processing speed. Based on the CMCH algorithm and Map Reduce model, a multiple data sources map join query algorithm and multi-channel data fusion feature extraction algorithm are designed. The two algorithms are executed on our built clusters and the results show that the CMCH improves the efficiency of multiple data sources join query and multi-channel data fusion feature extraction, and the execution time is just 32% and 35% respectively comparing with standard Hadoop.