度量空间数据管理分析方法把数据抽象成度量空间中的点,具有高度的通用性,是应对大数据多样性挑战的有效手段之一.由于度量空间没有坐标,很多数学工具无法直接使用,一般以数据到参考点(也称作支撑点)的距离作为坐标.支撑点的好坏对于度量空间数据管理分析的性能发挥着关键性的影响.最远优先遍历(Farthest First Traversal,FFT)可以选出数据拐角的点,具有线性的时间复杂度和空间复杂度,是使用最广泛的支撑点选取算法之一.但是,实验表明最好的支撑点往往不是最拐角的点,故FFT很难选出最好的支撑点.提出近期最远遍历(Recent Farthest Traversal,RFT)算法,只以近期的几个支撑点来选择下一个支撑点,能够更快地选出性能更优的支撑点.同时,实验表明FFT还可以在数据内部均匀抽样.提出支撑点集合选择算法(Pivot Set Selection,PSS),可以一次性选出所有支撑点.以RFT选择候选集,以FFT选择评价集,选出支撑点并构建相似性索引,PSS使得索引构建代价大大降低,索引性能得到一定提升.实验表明,RFT选出好的支撑点的速度远快于FFT,准确率高于FFT,而FFT的抽样效果良好.
Metric-space data management and analysis method abstracts data to the points of metric space has a high generality,and is one of effective approaches for the variety challenge of big data. Since coordinates do not exist in metric space, many mathematical tools can't be used directly and it is common to consider the distances from a data object to pre-selected reference points(or called pivots) as its coordinates. The quality of pivots is critical to the performance of data management and analysis in metric space. Farthest First Traversal(FFT) algorithm is able to select corners of the data set with linear time and space complexity, and is one of the most widely used pivot selection methods. However,experiments show that the best pivots are usually not the most corners, so it is very hard to select best pivots by FFT. This work proposes the Recent Farthest Traversal(RFT) algorithm. It selects next pivot only based on several recent pivots and is able to select better pivots faster. At the same time, experiments also show that FFT can sample points evenly in the data set. This work also proposes the Pivot Set Selection(PSS) algorithm, which is able to get all pivots just with one selection. With exploiting RFT to select a candidate set and exploiting FFT to sample an evaluation set, PSS can select a better set of pivots to create a similarity index. As a result, the construction cost of similarity index is greatly reduced and the index performance has some improvements. Experiments show that RFT selects good pivots far more rapidly than FFT and the accuracy of RFT is higher than FFT,while FFT generates a good sample of data.