高维特征选择问题是机器学习研究领域的公开问题,当前流行的1-范数约束正则化解决方案存在的主要问题是缺乏特征组选能力和特征选择能力受样本容量限制.本文从随机复杂度理论的模型冗余度最优下界推导得出了一种易于求解的基于零-范数约束的特征选择算法模型.该算法不仅可证优化,而且具备自动特征选择能力,克服了1-范数约束方法的主要缺点,算法不依赖于对数据真实生成模型的参数假设,具有广泛的适用性.仿真实验表明该算法在常规数据建模任务中的性能表现与1-范数约束方法相当,在真实基因数据集上的测试结果进一步验证了该算法在高维特征空间的性能优于近期发表的一些主要算法.
Feature selection for high-dimensional sparse feature space is an open issue for machine learning research,prevalent 1-norm regularization approaches share some theoretical drawbacks,such as lack the ability to select out grouped features,and can not select more features than the sample size.This paper considers the sparse modeling problem from the stochastic complexity theory perspective,and derive an easy computable model from its Minimax bound approximation.The proposed approach is proved to be optimized,and can perform automatic feature selection similar to its 1-norm penalized alternatives,but overcome their drawbacks.Furthermore,it does not rely on any parametric assumptions about the true data-generating mechanism,which makes it broadly applicable.Various simulations performed with both synthetic and real biological data show that the proposed approach performs similarly to the popular 1-norm penalized counterparts in ordinary experimental setups,and outperforms the other methods in robustness and predictive accuracy for extremely sparse problems.