针对交互文本句子短、成分缺失、多领域下类分布不均衡导致的高维、特征值稀疏、正样本稀少的难点,提出面向目标数据集实例迁移的数据层面采样方法。该方法提出目标数据集和源数据集共性特征的Top-N信息增益和值占比函数,选择评价两个数据集实例相似度的特征;提出目标数据集和源数据集特征空间一致性处理方法,克服两者特征空间不一致的问题;提出分领域的实例选取与迁移方法,克服多领域下的类分布不均衡问题。实验结果表明:该方法有效缓解了交互文本的非平衡问题,使支持向量机、随机森林、朴素贝叶斯、随机委员会4个经典分类算法的加权平均的接收者运行特征曲线(receiver operating characteristic,ROC)指标提升了11.3%。
A data level sampling method of target dataset-oriented instance transfer is proposed to solve the problem that the characteristics of interactive texts such as short sentences,missing parts of sentences and unbalanced class distribution in multiple-domains result in difficulties of high dimension,sparse eigenvalue in feature space and lack of positive instances.A function is employed to choose features for evaluating the instance similarity between source and target datasets.The function calculates the sum of the information gains of Top-N common features of these two datasets and their proportions in the sum.Moreover,a homogenization processing method is presented for feature spaces of the target dataset and the source dataset to overcome the feature spaces inconsistency between these two datasets.A method for selecting and transferring instances from a domain of source dataset to the corresponding one of target dataset is adopted to solve the problem of unbalanced class distribution in multiple domains.Experimental results show that the proposed method effectively alleviates the unbalanced problem in target dataset.The proposed method running with four classic classification methods,i.e.support vector machine,random forest,naive Bayes,and random committee,results in an 11.3%improvement in average of weighted receiver operating characteristic curve(ROC).