电子商务网站中不断增长的商品数量和商品规模对数据管理提出了新的挑战,其中一项重要基本任务是商品归一化,即识别属于同一个客观实体的所有商品.商品归一化的实现有助于提高商晶搜索的准确性、改善用户的体验.但由于在电子商务网站中,特别是在C2C(Customer—to—Customer)模式下,商品信息的数据质量很低且缺乏统一的模式定义规范,导致已有的商品归一化方法难以适用.针对这一问题,文中没计了一种将数据集成、数据清理和商品归一化相结合的混合框架.该框架首先基于图的方法进行模式集成,然后利用商晶的描述信息进行数据清理,从而得到数据质量更高且模式统一的商品信息数据;在数据集成和数据清理之后,利用逻辑斯蒂¨归(Logistic regression)模型训练分类器,从而得到商品之间的相似度矩阵,最后对相似度矩阵聚类实现商品归一化.通过与已有的方法在真实数据上进行对比实验,验证了文中提出的方法的有效性.
The booming of E-commerce in terms of product variety and quantity brings new chal lenges to data management, one of which is Product Normalization. Product normalization is to determine whether products are referring to the same underlying entity. It is a fundamental task of data management in E-commerce, especially for C2C (Customer-to-Customer) model, which can improve search functionality and user's shopping experience. However, Product normaliza tion in E-market is difficult because the data is full of noise and without a uniform schema, mak ing the existed normalization methods inefficient. In this paper, we propose a hybrid framework,experiments on a real-world data and the experimental results confirm the effectiveness of our design by comparing with the existing methods.