数据库自管理、自调优中查询计划的自动优化是目前的关注热点。为保证优化器估值精度,用统计学方法,给出了一种基于熵相关系数的对字段关联性的自动判别的新算法——COCA。该算法有下列特点:(1)限制少,没有卡方检验的频数限制,卡方检验只有在列联表中至少有80%的格子频数大于5的情况下才可信;(2)结果多,卡方检验(CORDS)只判断字段之间是否有关联,新方法可计算字段之间双向的关联程度。实验表明,新方法更坚固,产生更多的统计信息,可以支持后面更高效、准确地建立直方图。
Self-managing and self-optimizing is currently a hot research field in database. To guarantee the accuracy of the estimates made by optimizer, this paper proposed a new method named COCA (entropy-COrrelated-Coefficient-based Auto-detection of association). In comparison with CORDS, COCA has the following features: (1) Fewer limitations. It overcomes the limitation that Chi-square test needs at least 80% of the cells in the contingency table have frequencies greater than 5. (2) More results. CORDS can tell the correlation between columns, while COCA can further discern the specific association degree for both directions. Experiments show that COCA is more robust and produces more statistical information, which is supportive to the creation of more effective and efficient histograms.