电子商务网站使用不同的网页编码技术和页面布局为比较购物信息获取带来了很大的难度。基于隐马尔可夫模型(HMM)的信息抽取模型有着易于建立、适应性强等优点,被视为一种有效的信息抽取方法。但是这种算法存在状态序列计算复杂、难以训练优化抽取模型等缺点。本文应用模糊积分单调性建立基于Choquet积分的隐马尔可夫模型(CI—HMM),解决HMM观察序列概率计算所需的条件独立性假设,优化HMM观察序列的计算。本文以网上书店商品数据进行实证,实验证明CI—HMM比HMM有更好适用性和精确度。
E-commerce website applies different coding technology of webpage and webpage layout which brings great difficulty to access to information about comparison shopping. The model of information extraction based on Hidden Markov Model (HMM) is an effective method because HMM have many merits, for example, it is easy to set up and adaptable. But algorithm of HMM is difficult to optimize extraction model and to compute state sequence. This paper presents a Choquet integral Hidden Markov Model (CI-HMM), which applies fuzzy integral monotonicity property to solve assumptions of con- ditional statistical independence, and then optimizes the transfer of observation state sequence of calculations. Experiments which used the web pages of online bookstore as empirical data show that CI-HMM has better applicability and precision than HMM.