数据清洗是大数据中一个重要的主题。本文基于Hadoop设计并实现了一个大数据的云清洗系统。通过Map-Reduce计算模型,该系统能够检测并修复数据质量方面的各类问题。该系统包含以下特征:(1)支持数据质量方面各类问题的清洗工作;(2)数据云清洗进度可视化以及参数设置;(3)友好的数据集输入接口以及清洗后的数据集输出接口。该大数据云清洗系统对文本数据和数据库数据均是一个有效且高效的数据清洗系统。
Data cleaning is one of the central issues in big data. The paper describes a cloud clean system based on Hadoop for data cleaning. Using Map- Reduce model,the system detects and repairs various data quality problems in big data.The paper designs the system from the following features:( 1) the support for cleaning multiple data quality problems in big data;( 2) a visual tool for watching the status of big data cleaning process and tuning the parameters for data cleaning;( 3)the friendly interface for data input and setting and cleaned data collection for big data. The cloud clean system is a promising system that provides efficient and effective data cleaning mechanism for big data in either files or database.