在深网集成检索中,用户通常希望仅向少量数据源提交查询即可获得高质量的检索结果,因而数据源选择成为关键问题。为提升实体信息集成检索的效率,提出一种考虑相关性和重复度的数据源选择方法。给出基于主题与情感词的深网数据源摘要构建方法,利用用户反馈识别实体信息的主题类别,根据情感词度量数据源内容之间的重复性,并结合主题相关性和内容重复度设计相应的深网数据源计分策略。实验结果表明,该方法可以基于小数据摘要获得较高的准确率与召回率,为实体信息集成检索提供有效支撑。
People usually want to submit queries to only a few data sources to obtain high quality search results, so data source selection becomes a key issue in Deep Web integrated retrieval. To enhance the efficiency of entity data integrated retrieval,this paper designs a data source selection method based on relevance and repeatability. Firstly, it proposes a summary construction method based on subject and emotional words. The above method identifies subject category of entity information based on user feedback and calculates the data repeatability between two Deep Webs based on emotional words. Then, it proposes a Deep Web data source scoring strategy based on query subject relevance and repetition of content. Experimental result shows that the proposed method has higher accuracy and recall, although using a small data summary. It can orovide an effective suonort to entity infnrrnation integrated retrieval.