随着数据库规模的扩大,其模式的复杂度也不断地增加,复杂的模式和文档的缺乏使得理解和操作数据库更加困难.现有的模式抽象方法大多通过关系表中的主外键信息查找出模式中最重要的表,然后使用这些最重要的表来构成单层次的模式总结.在现实应用中,这些模式总结的主题并不明确.文中陈述了现有方法的不足,然后给出了一种为大规模数据库生成多层次模式抽象的方法.在此方法中,首先使用不同类型的社区社团检测算法来将数据库模式划分为"团",然后使用元聚类方法将这些"团"集成为数据库的主题组,每一个主题组代表数据库的一个主题.最后将这些主题组进行进一步的聚类以生成主题组类,并为每一个主题组类挑选标签以生成多层次的模式抽象.在Freebase——开源的大规模数据库上验证了文中算法的有效性.实验证明文中算法不仅能够精确地识别大规模数据库的主题,同时可以依据数据库的主题生成易于理解、能够帮助用户浏览和检索数据库的多层次模式抽象.
The complexity of database schemas and the lack of documentations usually make databases difficult to use.Some existing solutions attempt to identify the most important tables based on the foreign key relationships and use these tables as a summary of the database schema.However,in real world scenarios,the schema summaries generated by these approaches may fail to capture the subjects of the databases.In this paper,we describe the limitations of the previous approaches,and propose a principled method to summarize large-scale database schemas.Firstly,we partition a database schema into communities through a number of community detection algorithms.Then,we integrate these results into a set of groups,each presenting a subject.Finally,we cluster the subject groups into Abstract domains to form a multi-level navigation structure.Our approach is evaluated on Freebase,a real world large-scale database.The results show that our approach can identify subject groups precisely and the generated Abstract schema layers are very helpful for users to explore a database.