在基于核磁共振(NMR)的代谢组学数据分析中,尺度缩放是关键的预处理步骤之一,其主要目的是通过调整数据的方差结构,改善后续的多变量统计分析的结果。从信息熵的角度出发,利用KullbackLeibler(K-L)散度来度量不同实验分组的生物样品的1 H NMR波谱数据的差异程度,并结合单位方差缩放法,提出一种基于K-L散度的尺度缩放方法。该方法先利用单位方差法将数据各变量的标准差调整到同一水平上,再利用K-L散度对各变量进行有监督地加权,增强重要变量、减弱无关变量。由于K-L散度是在概率分布的意义上度量数据间的差异程度,且对于高斯和非高斯分布的数据均适用,因此能更准确地度量不同实验分组样品的1 H NMR波谱数据的差异性,从而更有效地地对谱数据的重要变量进行识别和加权。人群尿液1 H NMR波谱数据的分析结果表明,基于K-L散度的尺度缩放方法能有效抑制噪声变量,同时很好地区分特征变量和非特征变量;提高主成分回归(PCR)模型的判别能力;改善偏最小二乘回归判别分析(PLS-DA)模型的解释能力、预测能力以及对特征代谢物的辨识能力。
A new scaling method in the current study based on Kullback-Leibler (K-L) divergence is proposed for NMR metabo-lomic data .The proposed method (called K-L scaling) is a supervised scaling method as group information is incorporated in the scaling procedure .Notably ,K-L divergence measures the difference between two different datasets by their probability distribu-tions ,it can be used for the analysis of data that either follows Gaussian or non-Gaussian distributions .In K-L scaling ,all varia-bles were first standardized to unit variance ,then their variance was adjusted using Kullback -Leibler divergence to highlight the significant variables .K-L scaling can tell effectively the difference in spectral data points between two experimental groups ,and then enhances the weights of biological-relevant variables ,and at the same time reduces the weight of noise and uninformative variables .The developed method was applied to a 1 H-NMR metabolomic dataset acquired from human urine .Analysis results of the dataset showed that this new scaling method is efficient in suppressing the contribution of noise in the resulting multivariate model .In addition ,it can increase the weights of important variables ,and improve the interpretability and predictability of sub-sequent principal component regression (PCR) and partial least squares discriminant analysis (PLS-DA) .Furthermore ,the scal-ing method facilitated the identification of metabolic signatures .The current result suggested that the developed K-L scaling method may become a useful alternative for the preprocessing of NMR-based metabolomic data .