科学研究在经历了实验科学、理论科学、计算科学阶段后,进入了数据密集型科学阶段,与之相伴的是大数据时代的到来.大数据泛指规模达到几百TB,甚至PB级的数据①,其典型的特征是分布、异构、低质量等.尽管传统数据库管理技术(特别是商业关系型数据库)在过去40年间取得了巨大成功,但是这些技术和系统无法有效管理支持数据密集型科学与工程(Data-Intensive Science and Engineering,DISE)的大数据.文中探讨数据密集型科学与工程的具体需求和现实挑战.它涵盖的内容表现在4个层面,包括数据存储与组织、计算方法、数据分析以及用户接口技术等.同时,数据质量、数据安全、数据监护等内容也需要在各层面得到重视.文中尝试梳理了数据密集型科学与工程的整体架构,回顾了相关领域的新近发展,分析了面临的挑战,探讨了未来的研究方向.
Scientific exploration after experimental science, theoretical science and computational science phases, into data-intensive science phase, are accompanied by the arrival of the big data era. Generally, big data refers to a data set with a size of hundreds of TB, or several PB or even above, and it is often distributed, heterogeneous and in low-quality. It is critical to devise novel methods to manage big data since traditional database management techniques are unfeasible to manage big data efficiently and effectively, though such techniques, especially the commercial re- lational DBMSs, have achieved great success in the past decades. This paper discusses concrete requirements and realistic challenges of Data-Intensive Science and Engineering (DISE), ranging from data storage and organization, computational method, data analysis, to user interfaces. Meanwhile, data quality, data security and data curation should be paid more attentions. In this paper, we attempt to describe the architecture of DISE, review the recent progress, and discuss the challenges and future work briefly.