基于Web的未登录词(Out-of-Vocabulary,OOV)译文挖掘过程主要包含双语摘要获取、候选多词单元提取、最佳译文提取等步骤。通过改进候选多词单元提取方法和最佳译文选择方法以获取更高的译文挖掘准确率。在候选多词单元提取方面,在层次迭代的对数似然比(LLR)基础上提出了基于内部信息层次化过滤的对数似然比方法,相比LLR方法降低了噪音比且准确率提高了5%。在最佳译文选择方面,提出了基于左右熵(LRE)邻接信息过滤候选多词集合,同时将频度-距离模型(F-D)和基于LLR的词对关联度模型相结合使译文挖掘的召回率同比提高了5%~10%。
Web-based OOV(Out-of-Vocabulary)translation mining includes:collecting bilingual summary,extracting multi-word lexical units and selecting best candidate words.By improving the method of multiword lexical unit extracts and best candidate words selection,the better performance of OOV translation mining,was got.In term of multi-word lexical unit extracts,the method of hierarchical filtering based on internal information was introduced by using hierarchical iteration of Log-Likelihood Ratio(LLR),which got lower noise ratio and improved the accuracy by five percentages.In the aspect of best candidate words selection,the author used adjacency information based on Left-Right Entropy(LRE)to filter candidate multi-word set.Moreover,the method which combined Frequency-Distance(F-D)model and Word-pair Correlation model improved recall ratio of OOV translation mining to 5%~10%.