随着互联网爆炸式的发展和普及,网络信息已经成为了一种宝贵的信息数据资源。海量的网络数据使得数据分析与挖掘系统进入了一个新时代,越来越多的网络应用系统需要对来自不同数据源的结构化数据进行抽取、挖掘和整合。然而,由于网页文档的半结构化性质,网页上呈现的数据往往不能被机器自动地抽取和理解,因此,网络信息抽取的研究目标在于提取网页的结构化数据。互联网数据的海量规模与高度异构,为网络信息抽取带来了巨大的挑战。分析和总结了近年来网络信息抽取相关的研究与工作,剖析了各个工作的优势和局限,并进一步作了综合的分类与比较。
The World Wide Web has become an important resource of information due to its explosive growth and spread in the past two decades. The tremendous amount of web data has opened a new era for data analysis and mining systems. More and more web applications need to extract, mine, and integrate data from enormous data sources. However, due to the semi - structure characteristic of web pages, web data exhibited on web pages is not directly consumable by machines. Web information extraction aims at extracting structured data from web pages, which is a very challenging problem clue to the large - scale and highly - heterogeneous characteristic of web data. This paper introduces the state - of - the - art web information extraction studies, analyzes the advantages and limitations of each method, and conducts categorization and comparison of existing approaches.