在大数据时代,科学研究第四范式已经成为一种根本研究范式,云计算可以解决数据密集型科学研究中数据的存储、管理、注解和共享等,但仍然存在一些全新的挑战。文章提出云计算环境下科学工作流的数据溯源基本框架,详细阐述了该框架模型中溯源数据的收集、存储、查询的设计。这个溯源框架对科学工作流本身的性能无显著影响,具有最小入侵性;同时,允许用户指定从3个不同层级收集和查询溯源信息,来保证溯源的保真度,提高数据溯源的灵活性。
In the era of big data,the fourth paradigm of scientific research becomes a fundamental paradigm,in the mean while,cloud computing can solve the storage,management,annotation and sharing of data in data-intensive scientific research.However,there are still many new open challenges.This paper proposes a provenance framework for scientific workflow in cloud environment,and expounds the design of collection,storage and query for provenance data in the framework mode.The provenance framework,with minimal invasive,proposed in the paper has no significant effect on the performance of scientific workflow,and allow the users to collect and query provenance information from 3 different levels in order to ensure the fidelity of provenance and improve the flexibility of provenance.