提出一种基于图的半指导学习算法用于网页分类。采用k近邻算法构建一个带权图,图中节点为已标志或未标志的网页,连接边的权重表示类的传播概率,将网页分类问题形式化为图中类的概率传播。为有效利用图中未标志节点辅助分类,结合网页的内容信息和链接信息计算网页间的链接权重,通过已标志节点,类别信息以一定概率从已标志节点推向未标志节点。实验表明,本文提出的算法能有效改进网页分类结果。
This paper proposed a graph-based semi-supervise learning method, and applied to the Web document classification. Used k-nearest neighbor algorithm to construct a weighted graph with edge weights representing the similarity between the nodes, and the nodes in the graph were labeled and unlabeled Web pages. In order to use unlabeled data to help classification and get higher accuracy, computed edge weights of the graph through combining weighting schemes and link information of Web pages. By using probabilistic matrix methods and belief propagation, the labeled nodes pushed out labels through unlabeled nodes. The learning problem was then formulated in terms of label propagation in a graph. Experiments on the WebKB dataset indicate that the graph-based semi-supervise learning method can improve the effectiveness of Web document classification.