研究基于矩阵分解的词嵌入方法,提出统一的描述模型,并应用于中英跨语言词嵌入问题。以双语对齐语料为知识源,提出跨语言关联词计算方法和两种点关联测度的计算方法:跨语言共现计数和跨语言点互信息。分别设计目标函数学习中英跨语言词嵌入。从目标函数、语料数据、向量维数等角度进行实验,结果表明,在中英跨语言文档分类中以前者作为点关联测度最高得到87.04%的准确率;在中英跨语言词义相似度计算中,后者作为点关联测度得到更好的性能,同时在英—英词义相似度计算中的性能略高于主流的英语词嵌入。
This paper presents a unified model for matrix factorization based word embeddings, and applies the model to Chinese-English cross-lingual word embeddings. It proposes a method to determine cross-lingual relevant word on parallel corpus. Both cross-lingual word co-occurrence and pointwise mutual information are served as pointwise relevant measurements to design objective function for learning cross-lingual word embeddings. Experiments are carried out from perspectives of different objective function, corpus, and vector dimension. For the task of cross-lingual document classification, the best performance model achieves 87.04~ in accuracy, as it adopts cross-lingual word co-occurrence as relevant measurement. In contrast, models adopt cross-lingual pointwise mutual information get better performance in cross-lingual word similarity calculation task. Meanwhile, for the problem of English word similarity calculation, experimental result shows that our methods get slightly higher performance than English word embeddings trained by state-of-the-art methods.