代谢组学数据不可避免地受到各种刺激因素的作用,如何降低干扰因素的影响是代谢组学数据预处理的一个重要任务。详细分析了代谢组学数据方差的构成及其在特征空间中的分布特点,并在此基础上提出一种滤除未知干扰因素的新方法,提高感兴趣因素的显著性。文中采用真实的代谢组学数据验证新滤波算法的有效性,并与正交信号校正(orthogonal signalcorrection,OSC)方法进行比较。实验结果表明,新滤波方法可以在抑制未知干扰因素影响的同时,较好地保留感兴趣因素信息以及生物体内在的个体差异信息,降低模型发生过拟合的危险,使后续的统计分析结果更可靠。
The metabolomics dataset is disturbed by various stimuli inevitably.The main task for metabolo mics data preprocessing is to reduce the impacts of the disturbing factors.In present work,the formation of data variance and their distribution in feature space are analyzed.Furthermore,a new method to filtrate unknown disturb ing factors is proposed and the significance of interesting factors is improved.The efficiency of the new filtering al gorithm is estimated by real metabolomics dataset.Comparing with orthogonal signal correction(OSC) method,the experiment shows that the new method is superior in reducing unknown disturbing factors and retaining useful in formation and intrinsic individual differences in organisms.In addition,it can also prevent the overfitting of model and make the subsequent statistical analysis more reliable.