贝叶斯网络是用概率方法解决分类问题的有效工具,但学习贝叶斯网络是一个non-deterministic polynomial-time(NP)难题。以往的限制性学习算法大都假设网络结构中的结点具有基本相同的父结点数目,这往往与现实不相符的。为了学习更符合实际数据分布的限制性网络结构,进一步提高分类器的性能,本文对网络中每一个结点单独限制其父结点的数目,各个结点间是否存在父子关系是由它们之间的依赖强度所决定的.本文采用条件互信息方法度量依赖关系,这是因为条件互信息方法不但能够度量网络中各个结点之间的依赖关系,而且能够从整体上对网络结构性能进行打分.条件互信息的分解属性可以将这两者联系起来,通过对每一个结点局部限制的策略,可实现整体网络结构优化.基于这些思想,本文提出了一种学习限制性贝叶斯网络结构的局部打分搜索算法,通过此算法在20个加州大学欧文分校(University of California,Ⅳ Vine,UCI)的标准数据挖掘数据集合上与BDeu打分算法,基于最小描述长度的打分算法(minimum description length,MDL)打分算法,基于条件互信息的打分算法(conditional mutual information,CMI)打分算法和tree augmented naive bayes(TAN)算法等的比较,充分表明了本文所提出的策略具有较低的平均误分类率。
A Bayesian network is an efficient tool to deal with classification problem using probabilistic method. However, automatically identifying high-scoring Bayesian network structures from data is an non-deterministic polynomial-time hard(NP-hard) problem. Fortunately, a restricted Bayesian network serves as a bridge between Bayesian theory and practical applications. Recent work in supervised learning has shown that a surprisingly restricted Bayesian network classifier with strong assumptions of the same number of each conditional variable's parents, such as tree augmented naive bayes(TAN) model, is competitive with state of the art classifiers. This fact raises the question of whether a classifier with less restrictive assumptions could perform even better. In this paper,it tries to set different number of parents for different nodes in the Bayesian network structure. The strength of dependence related to two nodes decides whether there is a link between them. It adopts conditional mutual information testing as the measure method for dependence relationship. This is because conditional mutual information testing can not only measure the dependent strength between two nodes, but also score the performance of a Bayesian network structure. Crucially, this kind of measure has a property of additive decomposition of itself. Using this property, each node can be restricted individually, and the learned Bayesian network structure will be more suitable to the distribution of a dataset. According to the analysis above, a local scoring-search algorithm is proposed, which still based on conditional mutual information theory to build this kind of restricted networks. Experimental results have shown that it is more efficient than most common scoring methods, such as BDeu scoring algorithm, minimum description length (MDL) scoring algorithm, conditional mutual information(CMI) scoring algorithm and TAN algorithm, on 20 datasets obtained from the university of california IV vine(UCI) repository of machine le