多标签数据的过滤式特征选择依靠特征评价对特征选择,快速有效得到候选特征.但现有算法多将标签集合简单化,将其视作独立标签加以研究,忽视了多标签集合内部相互关系.近年来,由于MRMR算法在单标签数据领域方面简单、快速、高效的特征选择能力,成为过滤式特征选择算法的流行算法之一.提出一种基于MRMR(Max-Relevance Min-Redundancy)过滤式多标签特征选择算法(ML-MRMR),直接通过对特征进行权重计算,得到特征与多标签集合的相互关系,以获得更好的候选特征子集.同时,算法的特征评价过程中不仅考虑了特征间以及特征与多标签的相互影响,更考虑到多标签内部可能存在的相互关系,将标签相关性加入特征评价当中,提出了可适应多标签数据的度量标准.最后,在真实多标签数据集上的实验结果表明:所提算法能够对数据大幅降维并稳定有效地提高降维后数据的分类效果.
In the field of machine learning and data mining,feature selection is one of the most important problems and has become the focus of much research in areas of application for which datasets with tens or hundreds of thousands of variables are available.These areas include text processing of intent documents,gene expression array analysis,and combinatorial chemistry to find a subset of the most useful features which can represent the information hidden in data better from the entire set of features.It can reduce the dimensionality of original data,speed up the learning process and build comprehensible learning models with good generalization performance.Filter feature selection algorithms for multi-label data can select features with evaluating measures fast and effectively.However,many existing algorithms simplify the multi-label sets,and neglect the interrelations among multi-label sets.Recently,the MRMR algorithm has become one of the most popular filter algorithms,because it can select features effectively and efficiently for single label data.Therefore,this article proposes a filtering feature selection algorithm based on MRMR for multi-label data.In terms of making direct weight calculation on features,it succeeds in achieving interrelationship between features and labels to get more preferable candidate fea-ture subset.Meanwhile,this algorithm not only considers the interplay among features and feature with multi-label,but also considers the possible correlation among labels by the feature evaluation.In this way,it integrates the label relevance into feature-oriented evaluation and puts forward a new metric for feature evaluation in the environment.Finally,the experiment results on real data sets show that the algorithm can reduce the data dimension effectively,and improve the classification accuracy stably.