时序数据是数据挖掘的一类重要对象。在做时序数据分析时,若不考虑数据的时差,则会造成相关性的误判。所以,时序数据存在相关性和时差相互制约的问题。通过对时序数据的相关性和协同性进行研究,给出了双序列的相关性判定方法和曲线排齐方法。首先,从时间弯曲的角度分析了两类相关性错误产生的原因及其特点;然后,根据相关系数的渐近分布得到相关系数在一定显著性水平上的界,将两者综合得到基于时移序列相关系数特征的相关性判定方法;最后,提出一种基于相关系数最大化的曲线排齐模型,其适用范围比AISE准则更广。模型采用光滑广义期望最大化(S-GEM)算法求解时间弯曲函数。在构造数据和真实数据上的数值实验结果表明:该相关性判别方法在伪回归识别中,比常规的3种相关系数以及Granger因果检验更有效;提出的S-GEM算法在大多数情况下明显优于连续单调排齐法(CMRM)、自模型排齐法(SMR)和极大似然排齐法(MLR)。该文考虑的是双序列的线性相关问题和函数型曲线排齐方法,这些结果可为回归分析的相关性判定和时间对齐提供理论基础,并为多序列相关性分析和曲线排齐提供参考方向。
Time series data is an important object of data mining. In analysis of time series, misjudgment of correlation will occur if time lags are not considered. Therefore, there exists mutual restraint between correlation and time lags in time series. Based on the exploration of correlation and simultaneousness of time series, the correlation identification and curve registration methods for double sequences are given in this paper. Concretely, the study investigates the reasons and characteristics of two types of errors in correlation analysis in the view of time warping, and then deduces the correlation coefficient’s bounds in a certain significance level by its asymptotic distribution. Further, a correlation identification method based on time-lag series is proposed. Finally, the curve registration model of maximizing the correlation coefficient is presented with a broader application than AISE. Smoothing-generalized expectation maximization (S-GEM) algorithm is used to solve the time warping function of the new model. The experimental results on simulated and real data demonstrate that the proposed correlation identification approach is more effective than 3 correlation coefficients and Granger causality test in recognition of spurious regression. The registration method provided is obviously performed better than the classical continuous monotone registration method (CMRM), Self-modeling registration (SMR) and maximum likelihood registration (MLR) in most situations. Linear correlation of double series and functional curve registration are considered here, and the results can provide the theoretical basis for correlation identification and time alignment in regression and reference direction for correlation analysis and curves registration of multiple series.