提出一种在网格环境下的k近邻查询方法——GkNN.到目前为止,尚未有文献提出数据网格环境下的k近邻查询算法.当用户在查询节点提交一个查询向量和k,首先以一个较小的查询半径。在数据节点进行基于双重距离尺度的向量缩减,然后将缩减后的向量按照向量“打包”传输的方式发送到执行节点,在执行节点并行地对这些候选向量进行距离(求精)运算.最终将结果向量返回到查询节点.当返回的向量个数小于k时,扩大半径值,继续循环直到得到k个最近邻向量为止.理论分析和实验证明该方法在减少网络通信开销、增加I/O和CPU并行、降低-向应时间方面具有较好的性能,非常适合海量高维数据的查询.
Proposed in this paper is a novel k-nearest neighbor query algorithm based on data grid, called the GkNN. Three steps are made in the GkNN. First, when user submits a query vector and k, the vector reduction is performed using DDM index. Then the candidate vectors are transferred to the execution nodes by using vector package technique. Furthermore, the refinement process is conducted in parallelism to get the answer set of the candidate vectors. Finally, the answer set is transferred to the query node. The proposed algorithm uses vector reduction algorithm, vector package technique and pipelined parallelism to solve the problem of heterogeneity of network bandwidth between nodes on the data grid. The analysis and experimental results show that the performance of the algorithm is good in minimizing the response time by decreasing network transmission cost and increasing parallelism of I/O and CPU.