提出一种基于多种文本类型的半监督性别分类方法,即根据微博平台中用户所产生的不同类型的文本(如:原创微博、转发微博等)对用户的进行性别分类。文中的方法是一种基于协同训练(Co-training)的半监督学习方法,旨在减少分类器对大量标注样本的依赖。首先将不同类型的文本分为不同的独立视图;其次,在每个视图中利用LSTM分类器挑选置信度最高的未标注样本;最后,将挑选出来的未标注样本加入训练模型迭代训练。实验结果表明我们的方法能够有效利用非标注样本信息,并明显优于其他现有的半监督性别分类方法。
This paper proposes a novel semi-supervised approach to gender classification by exploiting multiple types of texts in micro-blogs(e.g.,original text and forward text).The approach is a semi-supervised learning approach based on co-training which aims to alleviate the dependence on large amount of labeled data.We divide the different types of text into different independent views,and we apply LSTM classifier to select unlabeled samples with highest confidence in each view,finally,we make the training model updated by adding the new obtained high-confidential samples.The experimental results show that our approach is effective for exploiting unlabeled data and outperforms other existing semi-supervised approaches to gender classification.