分布式系统技术为采用低成本购建高性能系统提供了有效的途径,但是由于资源的分配与需求可能产生冲突,造成系统中发生死锁,导致系统运行陷入停滞.在不可靠的分布式系统中,故障会干扰正常的死锁检测,但现有的死锁检测算法不具有容错功能.对失效形式进行了归类,提出一个容错的死锁检测解除算法.算法建立在通用的AND—OR模型基础上,采用扩散计算和集中规约方式,不仅能够检测到死锁,而且能给出死锁环的全部成员.若死锁拓扑处于静态且为环状,算法的消息复杂度的上限为e+n-1,时间复杂度为d,其中e为死锁等待图中边的个数,n和d为构成死锁环的节点的个数,分析表明算法性能等于或优于同类算法.
Distributed system provides an effective method to build high performance computing system with relative low cost. During the running of distributed computing, deadlock would happen if the resource allocation and requirement confliction occur. All the processes are waiting each other for the holding resources releasing, and then the system running stops. In an unreliable distributed system, failures may prevent deadlock detection algorithms from properly detecting deadlocks. Few of the algorithms proposed in the literatures address the issue of handing these failures. In this paper, three types of failures are identified and a fault tolerance deadlock detection and resolution algorithm is proposed. Failures are treated as different detection termination conditions in the algorithm. The algorithm is based on a generalized AND-OR model and employs diffusion computation and centralized reduction methods to detect deadlocks. The algorithm distinguishes cycles and knots and gives all members of a cycle. The upper limit of the time and space complexity is d and e + n - 1 respectively if the deadlock topology is a static cycle, where e is the number of the edge and n and d are the number of the nodes in the wait-for graph. The performance of the proposed algorithm is equal to or better than that Of the current algorithms.