本文主要研究大数据集下利用杠杆值抽样后的异常点诊断问题。首先讨论了数据删除模型中参数估计的统计性质,构造了四种异常点诊断统计量;其次,根据均值漂移模型的漂移参数的假设检验问题,构造了三种检验统计量;最后,通过模拟和实证数据分析结果得出本文的结论—异常点诊断对于基于杠杆值的大数据集抽样估计起到重要的影响作用。
In this paper, an outlier diagnosis method about big data leveraging sampling is studied. Firstly, we discuss the statistical properties of estimator of data deleted model and construct four diagnosis statistics. Then, we propose three test statistics for the hypothesis test of mean shift model. We find that it is very necessary to detect outliers in the big data leveraging sampling. At last, we conduct some simulations and a real data example to verify our method.