针对分布式存储系统上使用非主键访问数据带来的性能问题,探讨在分布式存储系统上实现索引的相关关键技术。在充分分析分布式存储特征的基础上,提出了分布式索引设计和实现的关键点,并结合分布式存储系统的特点及相关的索引技术,讨论了索引的组织形式、索引的维护和数据一致性等问题;然后基于如上的分析,选择在分布式数据库系统OceanBase开源版本上,设计和实现分布式索引机制,并通过基准测试工具YCSB进行性能测试。实验结果表明,虽然辅助索引会对系统性能产生影响,但因为充分考虑了系统特征及存储特点,在不同数据规模下,该索引都能够将性能影响控制在5%以内。另外,使用冗余列的方式,能进一步将该索引的性能提升100%。
For performance issues brought by using non-primary key to access data on a distributed storage system, key technologies were mainly discussed to the implementation of indexing on a distributed storage system. Based on the rich analysis of new distributed storage features, the keys to design and implementation of distributed index were presented. By combining characteristics of distributed storage system and associated indexing technologies, the organization and maintenance of index, data concurrency and other issues were described. Then, the distributed indexing mechanism on the open source version of OceanBase, which is a distributed database system, was designed and implemented. The performance tests were run on the benchmarking tool YCSB. The experimental results show that the distributed auxiliary index will degrade the system performance, but it can be controlled within 5% under different data scale because of the consideration of system features and storage characteristics. In addition, it can increase index performance by even 100% with a redundant colume way.