互联网已经成为企业和组织获取竞争对手情报的主要来源之一。建立基于Web的竞争对手情报自动获取系统已成为企业的迫切需求。在竞争对手情报自动获取系统中,商业机构名的识别是基础,它为竞争对手的标识和进一步情报抽取提供了依据。本文提出了一种基于互联网的商业机构名识别新方法。该方法考虑了商业机构名与其上下文之间的语义关联性,通过语义标注和隐马尔可夫模型相结合的方法进行商业机构名识别。我们以互联网上的真实中文网页为数据集对提出的识别算法进行了性能评估,并从召回率、准确率和F指标三个方面与CHMM(基于层叠隐马尔可夫模型的机构名识别算法)、MEM(基于最大熵模型的机构名识别算法)以及SVM(基于支持向量机的机构名识别算法)进行了对比。实验结果表明,本文提出的算法改善了商业机构名识别效果,并且具有很好的普适性。
Internet has been one of the major sources for enterprises and organizations to acquire competitive intelligence.And many enterprises have shown urgent requirements on building a Web-based system to acquire competitor intelligence.In such a Web-based competitor intelligence system,a fundamental issue is to recognize business organizations' names in Internet,because it is the basis of identifying competitors and extracting further intelligence from the Web.In this paper,we present a new approach to recognizing business organizations in Internet,which considers the semantic relationship between business organizations' names and their context in Web pages and recognizes organizations' names based on an integration of semantic annotation and the Hidden Markov Model(HMM).We conduct an experiment on a real dataset consisting of a large number of Chinese Web pages and evaluate the performance of our approach as well as three competitor algorithms including CHMM,MEM,and SVM,with respect to recall,precision,and F-measure.The results show that our new approach improves the effectiveness of the reorganization of business organizations ' names. Meanwhile,it is a general-purposed algorithm and can suit different types of tasks on business organizations recognition.