东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

基于近期最远遍历的支撑点选择

ISSN号：0469-5097
期刊名称：《南京大学学报：自然科学版》
时间：0
分类：TP311[自动化与计算机技术—计算机软件与理论;自动化与计算机技术—计算机科学与技术]
作者机构：[1]中国科学技术大学计算机科学与技术学院,合肥230026, [2]深圳大学计算机与软件学院广东省普及型高性能计算机重点实验室,深圳518060
相关基金：国家863项目（2015AA015305）,国家自然科学基金（U1301252,61471243）,广东省重点实验室项目（2012A061400024）,深圳市基础研究项目（JCYJ20140418095735561,JCYJ20150731160834611,JCYJ20150625101524056）,深港创新圈项目（SGLH20131010163759789）,广东省教育厅项目（2015KQNCX143）

关键词：度量空间, 多样性, 支撑点选择, 大数据, metric space, variety, pivot selection, big data

中文摘要：

度量空间数据管理分析方法把数据抽象成度量空间中的点,具有高度的通用性,是应对大数据多样性挑战的有效手段之一.由于度量空间没有坐标,很多数学工具无法直接使用,一般以数据到参考点（也称作支撑点）的距离作为坐标.支撑点的好坏对于度量空间数据管理分析的性能发挥着关键性的影响.最远优先遍历（Farthest First Traversal,FFT）可以选出数据拐角的点,具有线性的时间复杂度和空间复杂度,是使用最广泛的支撑点选取算法之一.但是,实验表明最好的支撑点往往不是最拐角的点,故FFT很难选出最好的支撑点.提出近期最远遍历（Recent Farthest Traversal,RFT）算法,只以近期的几个支撑点来选择下一个支撑点,能够更快地选出性能更优的支撑点.同时,实验表明FFT还可以在数据内部均匀抽样.提出支撑点集合选择算法（Pivot Set Selection,PSS）,可以一次性选出所有支撑点.以RFT选择候选集,以FFT选择评价集,选出支撑点并构建相似性索引,PSS使得索引构建代价大大降低,索引性能得到一定提升.实验表明,RFT选出好的支撑点的速度远快于FFT,准确率高于FFT,而FFT的抽样效果良好.

英文摘要：

Metric-space data management and analysis method abstracts data to the points of metric space has a high generality,and is one of effective approaches for the variety challenge of big data. Since coordinates do not exist in metric space, many mathematical tools can＇t be used directly and it is common to consider the distances from a data object to pre-selected reference points（or called pivots） as its coordinates. The quality of pivots is critical to the performance of data management and analysis in metric space. Farthest First Traversal（FFT） algorithm is able to select corners of the data set with linear time and space complexity, and is one of the most widely used pivot selection methods. However,experiments show that the best pivots are usually not the most corners, so it is very hard to select best pivots by FFT. This work proposes the Recent Farthest Traversal（RFT） algorithm. It selects next pivot only based on several recent pivots and is able to select better pivots faster. At the same time, experiments also show that FFT can sample points evenly in the data set. This work also proposes the Pivot Set Selection（PSS） algorithm, which is able to get all pivots just with one selection. With exploiting RFT to select a candidate set and exploiting FFT to sample an evaluation set, PSS can select a better set of pivots to create a similarity index. As a result, the construction cost of similarity index is greatly reduced and the index performance has some improvements. Experiments show that RFT selects good pivots far more rapidly than FFT and the accuracy of RFT is higher than FFT,while FFT generates a good sample of data.

同期刊论文项目