以Hadoop分布式系统架构中最核心的HDFS和MapReduce为基础,提出了一种面向海量地质文档的表格信息快速抽取的方法。为了提高地质文档表格信息抽取速度,首先利用关键词查找文档在HDFS中存储的根目录,其次利用Hadoop分布式集群中Map函数和Reduce函数实现单元格信息的抽取和信息还原显示,最后对重庆市矿产资源潜力评价成果数据中WORD文档进行表格快速抽取实验。实验证明,本文提出的地质文档表格信息快速抽取方法可以大幅缩减传统单机串行地质文档表格信息抽取所需的时间。
Based on the most core HDFS and MapReduce in Hadoop distributed system architecture,a rapid extraction method of table information for massive geological documents is proposed.In order to improve the extraction speed of geological information document form,first of all,using the key WORDs to find documents stored in the HDFS root directory,then,using the Hadoop distributed cluster Map function and a Reduce function reduction cell information extraction and information,according to the mineral resources potential evaluation result data in Chongqing in WORD document form rapid extraction experiments.It is proved that the method of rapid extraction of geological document table information in this paper can greatly reduce the time needed to extract the information of the traditional single-machine serial geological document form.