如何通过双语平行语料库提取语言之间的语义对信息,对改善跨语言信息检索的性能有着十分重要的意义.双语平行文档拥有相同的主题,这些双语主题在具体模型上可体现为语义相关.本文首先将双语平行文档看作同一语义内容的两种语言表示,从双语平行语料库构造每种语言的潜在语义空间,从而提出一种新的双语主题模型,即双语偏最小二乘主题相关模型.新模型克服了跨语言潜在语义索引模型没有充分考虑双语语义联系的不足.在中英双语新闻语料集上实验结果显示,新模型的文档配对搜索和伪查询跨语言检索性能明显优于跨语言潜在语义索引模型;在使用Google翻译得到的TREC-9双语平行语料库上,新模型也获得了较好的检索性能.
How to extract cross-language semantic meaning from bilingual parallel documents is important to improve cross-lingual in- formation retrieval. Bilingual parallel documents share the same topics, which are semantically correlative. The paper proposes a new bilingual partial least squares topic correlation model ( BiPLS ). The model views the parallel documents as two different lingual repre- sentations for the same semantic contents and builds a single topic space for each language from bilingual parallel corpus. Cross-lin- gual information retrieval is conducted in these new topic spaces. The new model overcomes the deficiency of the Cross-lingual latent semantic indexing (CL-LSI) that doesnot fully take into account bilingual semantic relationship. Experimental results on the aligned Chinese-English news collection show that BiPLS significantly outperforms over CL-LSI in mate search and cross-lingual pseudo query retrieve and better performs on TREC-9 blingual parallel corpus translated by Google Translation.