Deep Web数据源的发现和其领域相关性越来越引起人们的关注和兴趣。针对在判别查询接口时,提取精度低和忽略领域相关性的问题,提出一种采用多分类器对Deep Web数据源进行自动分类和判别的方法,其思想是:对爬虫获取到的页面使用朴素贝叶斯分类器对其进行领域相关性分类,然后使用改进的决策树分类器来对特定领域的数据源进行判定。实验结果表明此方法相比于使用单一决策树分类器有更好的性能,其召回率和精度都有所提高。
Recently, the discovery of Deep Web data source and its domain correlation attract growing attention and interests. This paper proposed a method using multi-classifier to automatically classify and discriminate the data source of Deep Web to solve the problem that when discriminating the query interfaces the extraction precision is low and the domain correlation is overlooked. The notion of the method is ,first it uses Naive Bayes classifier to classify the pages snatched by the crawler upon their domain correlation; secondly, it uses the improved CA. 5 Decision tree algorithm to judge the data source in specific domain. The result of the experiment competed with the single decision tree classifi- er proved that this method has better performance in higher recall rate and precision.