大数据环境下,数据缺失现象十分普遍,导致许多基于数据的决策出现偏差.传统的数据库缺失值修复方法主要是利用本地数据库来修复数值型数据,这些方法并不适用于利用互联网数据来修复数值型和非数值型数据.基于互联网的缺失值修复过程一般包括生成查询、检索文档集、抽取实体、实体排序4个步骤,其中候选实体的排序决定了最终用于修复数据库的信息.现有的利用互联网数据来修复缺失数据的研究主要集中在两个方面:一是提升查询和抽取的质量,然后对抽取的候选实体按频率进行排序;另一种是分析目标实体应该具有的特征,然后对候选实体计算特征值,最后用权值叠加进行排序.这两类方法都只是考虑了实体自身的因素,而忽略了实体之间的影响.文中针对候选实体的排序建立了图模型,基于该图模型提出了上下文相关的实体排序算法CER(Contextaware Entity Ranking),该算法能够把候选实体在网页中的上下文特征充分利用起来并用实体间的影响来推断新信息,从而得到更准确的排序结果.基于真实数据集的实验结果表明,相较于频率统计和权值叠加的实体排序算法,CER算法能利用互联网的海量数据对关系数据库中的缺失值进行更加有效的修复.
In Big Data era, data missing is very common in real life and it puzzles people since it makes decisions based on data unreliable. Most existing data imputation methods employ local database to repair missing numerical values, while these methods do not fit the case that repair missing numerical and non-numerical values using data from web. Web-based data imputation usually contains four steps, formulating queries, searching, entity extraction and entity ranking. During these steps, entity ranking plays a key role and makes the final decision on repairing. Recently works on web-based data imputation are major in two aspects, one makes efforts to improve query formulating and entity extracting, then uses frequency to rank, the other one makes efforts to analyze features that belong to target entities, then calculates and combines features' values to rank. Frequency-based or weighting-based entity ranking method considers factors related to entity itself while ignoring the influence between entities. In this paper, we propose a graph-based entity ranking method called CER(Context-aware Entity Ranking), it can take advantage of the context of candidate entities and make a comprehensive ranking utilizing thegraph model. Experiments based on real-world data more effective data imputation utilizing massive web such as frequency based and weighting-based. collections demonstrate that data than the existing entity CER performs a ranking methods