根据领域性较强的网站往往蕴含大量平行或可比较双语样本这一特点,针对特定领域双语网站的自动识别问题,提出了一种基于全局搜索和局部分类的方法。以电子器件领域为目标,采用全局搜索方法获得该领域双语网站18 944个,随机抽取其中3 000个网站进行人工标注,在标注语料上,采用局部分类方法识别该领域双语网站的性能(F值)达到85.19%。在此基础上,利用识别出的目标领域双语网站中的双语句对,扩充特定领域机器翻译系统的训练集进行实验。实验结果表明,相同测试集下,特定领域机器翻译系统的性能获得显著提升,验证了本文所提出的自动识别特定领域双语网站方法的有效性。
Based on the phenomena that domain-specific bilingual websites tend to contain large amounts of parallel or comparable bilingual texts,we proposed a novel method for specific-domain bilingual websites identification.The method devotes to identify those websites automatically based on global retrieval and local classification.And it optimizes the identification process from the aspects of recall and precision.We experiment on the domain of electronic devices and obtain a total of 18 944 websites in the process of global retrieval.The local classification is based on 3 000 samples extracted randomly from the obtained websites and annotated manually,which gets a F1-Measure of 85.19%.Additionally,we expand the training set of a specific-domain translation system with bilingual corpus extracted from identified websites and achieve improvements,which verifies the availability of our method.