微阵列数据广泛而成功地应用于生物医学的癌症分类研究.一个典型的微阵列数据集包含大量(通常成千上万,甚至数十万)的基因、相对少量(往往不足一百)的样本.在这成千上万的基因中,仅仅一少部分基因对癌症分类有贡献.因而,对于癌症分类来说,最重要的一个问题就是识别出对癌症分类最有贡献的基因.这一识别过程称为基因选择.基因选择在统计模式识别、机器学习和数据挖掘领域已得到广泛研究.介绍基因选择问题所涉及到的相关背景知识和基本概念;全面地回顾统计学、机器学习和数据挖掘领域对基因选择问题的解决方法;通过实验展示了几种典型算法在微阵列数据上的性能;指出当前存在的问题和未来的研究方向.
Microarray data has been widely and successfully applied to cancer classification, where the purpose is to classify and predict the diagnostic category of a sample by its gene expression profile. A typical microarray dataset consists of expression levels for a large number (usually thousands or ten thousands) of genes on a relatively small number (often less than one hundred) of samples. Of the tens of thousands of genes, only a small number of them are contributing to cancer classification. As a consequence, one basic and important question associated with cancer classification is to identify a small subset of informative genes contributing the most to the classification task. This procedure is usually called gene selection. Gene selection has been widely studied in statistical pattern recognition, machine learning and data mining. The authors attempt to review the field of gene selection based on their earlier work, introduce the background and the two basic concepts (gene relevance, relevance measure) of gene selection, categorize the existing gene selection methods from statistics, machine learning and data mining areas, demonstrate the performance of several representative gene selection algorithms through an empirical study using public microarray data, identify the existing problems of gene selection, and point out current trends and feature directions.