信息抽取是从大量的数据中准确、快速地获取目标信息,提高信息的利用率。考虑网页数据的特点,提出一种适用于网页信息抽取改进的隐马尔科夫模型(HMM),即结合最大熵模型(ME)在特征知识表示方面的优势,在HMM模型中加入后向依赖,利用发射单元特征来调整模型参数。改进后的HMM状态转移概率和观察输出概率不仅依赖于模型的当前状态值,而且可以以模型的前向状态值和后向特征值加以修正。实验结果表明,使用改进后的HMM模型应用到网页信息抽取中,可以有效地提高网页信息抽取的质量。
The task of information extraction is to obtain the objective information precisely and quickly from a large scale of data and improve the utilization of information. According to the characteristics of web data,an improved hidden Markov model(HMM) for web information extraction is proposed,which means combining the advantage of maximum entropy(ME) model in the representation of feature knowledge. The backward dependency assumption in the HMM is added and the model parameters are adjusted by using the characteristic of the emission unit. The state transition probability and the output probability of the improved HMM are not only dependent on the current state of the model,but also be corrected by the forward and backward state values of the historical state of the model. The experimental results show that applying the improved HMM model to web information extraction can effectively improve the quality of web information extraction.