基于图的半监督学习近年来得到了广泛的研究,然而,现有的半监督学习算法大都只能应用于同构网络。根据查询及文档自身的内容特征和点击关系构建查询一文档异构信息网络,并引入样本的判别信息强化网络结构。提出了查询一文档异构信息网络上半监督聚类的正则化框架和迭代算法,在正则化框架中,基于流形假设构造了异构信息网络上的代价函数,并得到该函数的封闭解,以此预测未标记查询和文档的类别标记。在大规模商业搜索引擎查询日志上的实验表明本方法优于传统的半监督学习方法。
Various graph-based algorithms for semi-supervised learning have been proposed in recent literatures. However, although classification on homogeneous networks has been studied for decades, classification on heterogeneous networks has not been explored until recently. The semi-supervised classification problem on query-document heterogeneous information network which incorporate the bipartite graph with the content information from both sides is consid- ered. In order to strengthen the network structure, class information of sample nodes is introduced. A semi-supervised learning algorithm based on two frameworks including the novel graph-based regularization framework and the iterative framework is investigated. In the regnlarization framework, a new cost function to consider the direct relationship between two entity sets and the content information from both sides which leads t'o a significant improvement over the baseline methods is developed. Experimental results demonstrate that proposed method achieves the best performance with consistent and promising improvements.