句子边界识别是藏文信息处理领域中一项重要的基础性工作,该文提出了一种基于最大熵和规则相结合的方法识别藏语句子边界。首先,利用藏语边界词表识别歧义的句子边界,最后采用最大熵模型识别规则无法识别的歧义句子边界。该方法有效利用藏语句子边界规则减少了最大熵模型因训练语料稀疏或低劣而导致对句子边界的误判。实验表明,该文提出的方法具有较好的性能,F1值可达97.78%。
Sentence boundary identification is a fundamental work in the field of Tibetan information processing.This paper proposes a maximum entropy and rules approach to identifying Tibetan sentence boundaries.First,the Tibetan boundary vocabulary based detector identifies the ambiguous sentence boundaries.Second,the maximum entropy model based detector identifies the ambiguous sentence boundaries which the former detector can't identify.By making use of Tibetan sentence boundary rules,this approach further reduces the number of the incorrect sentence boundary identified by maximum entropy model owing to the sparse and inferior training corpus.The experiments show that this approach has a good performance in terms of 97.78% F1-measure.