该文针对大规模汉语语音检索任务提出汉语语音检索中的集外词问题和针对集外查询词的两阶段检索方法。汉语语音识别和检索中,集外词可以以词表词序列的形式被识别和检索到,因此被认为不存在集外词问题;该文发现集外查询词性能远远低于集内查询词,将此问题定义为汉语语音检索任务的集外词问题,并提出两阶段的检索方法,第一阶段通过模糊音素匹配的方法提高查全率,第二阶段通过词格修正的方法提高查准率。实验表明,两阶段的检索方法极大的提高了典型集外查询词的检索性能,FOM指标相对基线系统提高了24.1%。
While the Out-of-Vocabulary (OOV) problem remains a challenge for English spoken term detection tasks, it is underestimated for Chinese. This is because a Chinese OOV query term can still be matched as a sequence of Chinese characters, with each character itself being a word in the vocabulary. However, our experiments show that search accuracy levels differ significantly when a query is or is not in the vocabulary. We examine this problem with a word-lattice-based spoken term detection task. We propose a two-stage method by first locating candidates by partial phonetic matching and then refining the matching score with word lattice rescoring. Experiments show that the proposed method achieves a 24. 1% relative improvement for OOV queries on a large-scale Chinese spoken term detection task.