开放领域问答系统(QA)能够给用户提供相对简洁、准确的结果,越来越受到人们的关注.问题分类把问题分成若干语义类型,是QA系统的一个重要的模块,它的准确性直接影响到QA系统的性能.为提高分类器性能,在问题分类任务中使用了集成学习方法,并且实验比较了词汇、句法、同义词集等不同的分类特征及错误驱动、投票法、BP神经网络等分类器集成方法.通过采用基于错误驱动集成分类器,用规则方法TBL作为统计方法SVM的补充;利用来自Wordnet的同义词集和名词的上位概念及Minipar的依存关系等语言知识作为分类特征,在公开测试集中取得了更高的分类精度.
As a very active branch of natural language processing, open-domain question answering (QA) system has been attached increasing attention to, for it can understand the question in natural language, and thus provide its users with compact and exact results. Question classification (QC), i.e., putting the questions into several semantic categories, is very important for a question answering system and directly affects the performance of the QA system in selecting correct answers. Its main task is to understand the demand of users. In this paper, to investigate automatic question classification, different classification features, such as Bag-of-words, Bi-gram, synset from Wordnet and dependency structure from Minipar, are compared. Support vector machine (SVM) and such machine learning ensemble approaches as transformation-based error-driven learning (TBL), vote and back propagation artificial neural network (BP) are experimented on. Compared with single-feature SVM, multi-feature SVM classifiers and BP, vote ensemble learning means, and the question classification algorithm are presented in this paper. The method, by using combined multiple SVM-classifiers based on a TBL algorithm and with linguistic knowledge like synset from Wordnet and dependency structure from Minipar as question representations, is proved to be more accurate in open question classification corpus. And using dependency structure, a 1.8 % improvement over the no use of it is achieved.