近红外(NIR)光谱一般具有较多的波长变量数,对其直接或间接地进行变量选择是提高模型稳定性能及预测性能的关键。最小角回归(LAR)是一种相对较新和有效的机器学习算法,常用于进行回归分析和变量选择。面向光谱建模应用,提出一种LAR结合遗传偏最小二乘法(GA-PLS)的变量选择方法,可有效筛选出少数特征波长点。首先在全光谱区利用LAR消除变量间的共线性得到初筛波长点,然后用GA-PLS对LAR筛选出的波长点进一步优选从而得到最终建模用的特征波长点。为验证本文方法的有效性,以药片和汽油的近红外光谱回归分析作为应用案例,对原光谱进行预处理后,采用该方法进行变量筛选,然后分别建模其中的活性成分含量和C10含量。结果显示,在这两个应用中,最终优化得到的特征波长点数均只需七个,而两者的预测决定系数R2p分别达到0.933 9和0.951 9,与全光谱、无信息变量消除法(UVE)和连续投影算法(SPA)等方法相比,特征波长点更少,同时R2p和预测均方根误差RMSEP值更优。因此,LAR结合GA-PLS,能有效地从近红外光谱中选择出信息变量从而减少建模波数,提高预测精度,拥有较好的模型解释性。该方法可为特定领域的专用光谱仪设计提供有效的波长筛选工具。
Near infrared (NIR) spectra usually have many wavelength variables. Direct or indirect variable selection is crucial to improve the stability and prediction performance of a model. Least angle regression (LAR) is a relatively new and efficient machine learning algorithm for regression analysis and variable selection. By combining LAR and genetic algorithm-partial least square (GA-PLS) algorithm, a wavelength selection method is proposed in this paper for spectral modeling applications, which can effectively screen a few wavelength points. Firstly, LAR is used to eliminate the multiple-collinearity among variables in the full spectrum region and get a reduced set of features, then GA-PLS is introduced to select the variables from the reduced set of features to achieve the purpose of further dimension reduction. In order to verify the validity of it, the method is carried out by making regression analysis on the NIR spectroscopy of tablets and gasoline. The pre-processing results of original spectra are used to select the variables and have modeled on the content of active ingredients (Tablets) and C10 (Gasoline). As a result, the optimal number of variables are just 7 in both of applications, and the predictive decision coefficient (R2p) reached 0.933 9 and 0.951 9 respectively. Moreover, by comparing with the full spectrum, elimination of uninformative variables (UVE) and successive projection algorithm (SPA) model, the result shows that this method needs less wavelength points and have more excellent in R2p and root mean square error of predication (RMSEP). Therefore, LAR combined with GA-PLS not only can picks out information variables from NIR spectroscopy to reduce the variable number for modeling and improve the prediction accuracy, but also has a better interpretation of the model. The method can provide as effective wavelength selection tool for designing of special spectrometer in particular area.