该文提出了一种可扩展的网页关键信息抽取框架.该框架很好地融合了模板无关的全自动信息抽取算法和基于模板的信息抽取算法,从本质上提高抽取精度和抽取效率.该框架中的一些关键环节可根据需求进行替换,因此该框架具有很好的可扩展性.同时,该文还提出了模板的正交过滤算法.将该算法引入基于模板的抽取算法中,能够从本质上提高生成的模板的准确性.实验结果验证了上述结论.
An extensible framework of web key information extraction is presented in this paper. This framework combine automatic information extraction algorithms and template detection algorithms, essentially improving the precision and efficiency of extraction. Some key parts of this framework can be replaced as required, therefore it has excellent extensibility. Furthermore, this paper also describes an orthogonal filter algorithm, Which improves the precision of template generation. And the experiments provide positive results for this method.