大数据时代数据纷繁复杂,同时在数据挖掘过程中数据质量又至关重要,数据质量的高低将直接影响数据挖掘结果的好坏,但现实中数据缺失和噪声数据的现象在所难免。针对上述问题,通过引入空间对象的自相关性理论和模糊集理论,提出一种基于空间自相关性和模糊集的空间数据噪声点检测算法。该算法首先运用邻域对象的空间自相关性理论,计算出特定对象与邻域内其他对象的距离,进而将距离以模糊隶属度的概念予以表达,最后通过与该属性的置信水平进行比较,以此来判定噪声数据。理论分析和实验对比结果均表明,该算法对于处理空间数据噪声点问题是有效可行的。
Data shows more complex characteristics in the era of big data. Meanwhile,the quality of data is crucial in the process of data mining and will directly affect the results of data mining,but the phenomena of data missing and noise data are inevitable in reality. Aiming at the above problems,by introducing the theory of spatial auto-correlation of spatial object and the theory of fuzzy set we propose a spatial data noise point detection algorithm. First,the algorithm calculates the distance between the specific object and other objects within its neighbourhood by using spatial auto-correlation theory of neighbourhood object. Then it expresses the distance by the concept of fuzzy membership degree. Finally,it determines whether there is a noise data by comparing with the confidence level of the attribute. Theoretical analysis and experimental comparison results all show that this method is effective and feasible in handling the problem of spatial data noise point.