为了估计网络数据库的大小,提出了基于Capture—Recapture过滤二字亲密、二字排斥的方法。通过在接口文本框提交属性高频字,利用返回的结果集,在两两之间作交集,根据交集中的两字分布分析采样的独立性,过滤掉其中不独立的情况,再利用Capture—Recapture方法估计网络数据库的大小。在模拟和真实的环境下进行了实验.该方法偏差度和波动度均较小。
In order to estimate the size of Web database, this paper proposed the Capture-Recapture based estimation methods that filtered out two words intimate and rejection cases. Submitting attributed high-frequency words in the text box of query interface, using the returned result, in the intersection of two results analyzing the independence of two sampling, filtering the dependent couples, and then using Capture-Recapture method estimated the size of Web database. In the simulated and real environment for the experiment, the bias and the volatility of the method are smaller.