识别和定位特定领域双语网站,是基于Web自动构建特定领域双语语料库的关键。然而,特定领域双语网站之间的句对质量往往差异较大。相对于原有基于句对文本特征识别过滤质量较差句对的方法。该文从句对的来源(即特定领域双语网站)出发,依据领域权威性高的网站往往蕴含高质量平行句对这一假设,提出一种基于HITS算法的双语句对挖掘优化方法。该方法通过网站之间的链接信息建立有向图模型,利用HITS算法度量网站的权威性,在此基础上,仅从权威性高的网站中抽取双语句对,用于训练特定领域机器翻译系统。该文以教育领域为目标,验证"领域权威性高的网站蕴含高质量句对"假设的可行性。实验结果表明,利用该文所提方法挖掘双语句对训练的翻译系统,相比于基准系统,其平均性能提升0.44个BLEU值。此外,针对HITS算法存在的"主题偏离"问题,该文提出基于GHITS的改进算法。结果显示,基于GHITS算法改进的机器翻译系统,其性能继续提升0.40个BLEU值。
Identifying and locating domain-specific bilingual websites is a crucial step for the Web-based bilingual resource construction. However, the quality of sentence pairs varies among different bilingual websites. In contrast to the existing method focusing only on the sentence internal features, we explore the sentence pairs' origin information for identifying and filtering the low-quality sentences pairs. We hypothesize that, if a website is authoritative in the target domain, it tends to contain more high-quality sentence pairs. Thus, we propose a HITS based optimization method for mining domain-specific bilingual sentence pairs. In this method, we first construct a directed-graph model based on the link-info among the websites. Secondly, we propose a HITS based method for evaluating the authori- ty of websites. Finally, we only extract the sentence pairs from the authoritative websites, and use them to enlarge the training-set of our machine translation system. Experimented on the education domain, our system achieves improvements of 0.44% BLEU score compared with existing method. A further proposed GHITS method achieve ad- ditional improvements of 0.40% BLEU score.