大数据系统的蓬勃发展催生了大数据基准测试的研究,如何公正地评价不同的大数据系统以及怎样根据需求选取合适的系统成为了热点问题.然而,应用领域的广泛性、数据类型的多样性和数据操作的复杂性使得大数据基准测试集的设计面临很大的挑战.现有的相关基准测试工作要么针对某一类特定的应用或软件栈,要么根据流行度主观地选择大数据负载,难以全面覆盖大数据的多样性和复杂性.针对现有工作的不足,文中讨论大数据评测基准需要满足的需求,并研制了一个跨系统、体系结构、数据管理3个领域的大数据基准测试开源程序集——BigDataBench.它覆盖5个典型的应用领域(搜索引擎、电子商务、社交网络、多媒体、生物信息学),包含结构化、半结构化、非结构化的数据类型,涵盖离线分析、交互式分析、在线服务、NoSQL这4种负载类型.目前包含14个真实数据集、3种类型的数据生成工具以及33个负载的不同软件栈实现.BigDataBench已广泛应用到学术界和工业界中,应用案例包括负载分析、体系结构设计、系统优化等.基于BigDataBench,中国信息通信研究院联合中国科学院计算技术研究所、华为等国内外知名公司和科研机构共同制定了国内首个工业标准的大数据平台性能评测标准.
Booming big data sparks tremendous outpouring of interest in storing and processing these data, and consequently a variety of big data systems emerge, giving rise to great pressure on big data benchmarking. However, complexity and diversity of big data raise great challenges in big data benchmarking. Most of the related benchmark efforts either target at specific application domains and software stacks, or choose workloads subjectively according to so-called popularity, thus fail to cover the diversity and complexity of big data. In this paper, we discuss the requirements for big data benchmarking and present our open source big data benchmark suite--BigDataBench, which is a multi-discipline research and engineering effort, i.e. system, architecture, and data management. BigDataBench adopts an iterative and incremental methodology, not only covering five representative application domains, but also containing diverse data models and workload types. Currently, it includes 14 real-world data sets, scalable data generation tools for 3 kinds of data types, and 33 workloads implemented using competitive technologies. BigDataBench has been used both in academia and industry, with typical use cases of workload characterization, architecture design and system optimization. Based on BigDataBench, Chinese Academy of Information and Communications releases China's first industry-standard big data benchmark suite together with ICT, CAS, Huawei and other well-known companies and research institutions.