线性回归中当备选变元的个数(P)大于样本量(n),尤其当p〉〉n时,很多经典的统计推断可能失效。因此.高维数据分析技术的理论和实证探讨很有必要。本文讨论了高维数据分析面临的3种新问题.并介绍了SIS、LASSO等6种高维选元方法。模拟部分选用了5种评价准则比较了上述6种方法的选元效果,对比后发现p/n比率和选元效果是相关的:p/n比率较高时SIS的选元效果最好。而当p/n比率降低,特别是降低到p〈n的情形时,除平方根LASSO外的5种选元方法的选元效果趋近一致。在纳税评估中,行业细分一般会提高评估效果,但细分会使得备选变元的个数大于样本量.此时需要借助高维数据选元技术。本文使用SIS方法对某市13个细分行业的进项税额进行建模,研究结果表明SIS方法的选元效果显著。
When the number of candidate predictor variables (p) is greater than the sample size (n) in linear regression, especially if p 〉〉n, a lot of classical statistical inference might be invalid. Therefore, it is necessary to do the theoretical and empirical research of high-dimensional data analysis techniques. This article discusses three new problems that would be encountered in high-dimensional data analysis, and introduced six variable selection methods such as SIS and LASSO. At the simulation part, five evaluation criteria are chosen to compare the variable selecting effect of the above six methods. After comparison, it is found that the p/n ratio is related to variable selecting effect: when the p/n ratio is high, the best method is SIS, and as the ratio reduces, especially as the p/n ratio satisfies the condition of p 〈n, the effects of the above five methods except the square-root LASSO are beginning to converge. In the tax assessment, industry segmentation will generally improve the effect of assessment, but the segmentation will cause the number of candidate predictor variables become greater than the sample size. So it is needed to resort to the variable selection techniques in high- dimensional data. In this paper the SIS method is employed to model the VAT input tax of 13 subdivided industries in one city. The results indicate that SIS method has the significant variable selecting effect.