搜索引擎在处理全称和简称的对应关系时,以往只能通过人工添加,造成简称遗漏、搜索结果召回率低等问题。为此,提出了一种自动获取机构全称和简称的方法。根据域名地址获取机构网站首页源代码,从中提取相应机构全称,再结合机构名上下文特征词集合从中提取候选简称,最后计算候选简称与全称的相似度确定最终简称。通过对1287个组织机构网站的实验,全称提取正确率达93.9%,简称召回率和正确率分别达85.3%和90.8%,实验表明该方法效果良好。
When processing the correspondence between full names and abbreviations, search engine can only use the way of manually adding in the past, resulting in abbreviations omission and low recall rate of search results. To solve these problems, this paper proposed an extraction method of organizations' full names and abbreviations based on Web page and word segmentation. It obtained source code of website homepage of organization firstly. Then it extracted relevant organization full name from the source code, and extracted candidate abbreviations based on contextual features collection of organization names. Finally it calculated the similarity between candidate abbreviations and full name to determine which candidates were the exact abbreviations. Through experiments on 1 287 organization websites, the full names' correct rate of this method is 93.9% , the abbreviations' recall rate and correct rate are 85.3% and 90.8% separately. Experimental results show that the method has a good effect.