在大规模的Yahoo数据和百度数据上利用问题本身、问题描述、最佳答案,以及其他答案构建了7种不同的翻译模型,并且在两个人工标注数据集上对比了这些模型在问题检索上的表现,研究利用机器翻译的技术在社区问答网站上进行问题检索.实验结果显示,这些模型都可以提升传统语言模型在问题检索上的效果,但在Yahoo数据和百度数据上,模型的表现并不相同.在平均答案数较多的Yahoo数据上,利用问题、问题描述和所有答案串联建立的模型表现最好,而在百度数据上,只用问题和问题描述就可以达到最好的效果.
The paper studies the problem of leveraging the techniques of machine translation for question retrieval in community question answering (CQA) sites. The paper leverages questions, question descriptions, best answers, and other answers from large scale Yahoo data and Baidu data and trains 7 variants of translation based retrieval models. We compare different models on two manually labeled data sets. The experimental results reveal that all the translation based models can improve the traditional language model for information retrieval on question retrieval. Moreover, the performances of different models are not consistent on Yahoo data and Baidu data. On Yahoo data, in which there are more answers per question, translation model trained with questions, descriptions and concatenation of all answers has the best performance, while, on Baidu data, the best performing model is learned with only questions and their descriptions.