商品属性及其对应值的自动挖掘,对于基于Web的商品市场需求分析、商品推荐、售后服务等诸多领域有重要的应用价值。该文提出一种基于网页标题的模板构建方法,从结构化网页中抽取完整的商品“属性值”关系。该方法包含四个关键技术:1)利用商品网页标题构建领域相关的属性词包;2)基于预设分隔符细化文本节点;3)结合领域商品属性词包获取种子“属性值”关系;4)结合网页布局信息和字符信息来筛选与构建模板。该文的实验基于相机和手机两个领域展开,获得94.68%的准确率和90.57%的召回率。
If we represent the products as attributes and attribute values, it will improve the effectiveness of many applications, such as demand forecasting, product recommendations, and product supplier selection. In this paper, we propose a novel pattern based method to extract the "attribute-value" pair of product from structured or semistructured Web pages. This approach contains four key components: 1) acquire domain-specific attributes from tities of Web pages in the same domain. 2) refine text nodes based on some default delimiters. 3) coIlect seed "attribute-value" pairs based on the domain-specific attributes. 4) construct high-quality patterns by combining page-specific layout information and character information. The experimental corpus is collected from two domains: digital camera and mobile phone. Experiments show the proposed method can schieve 94.68%in precision and 90.57% in recall.