针对网格计算可靠性需求,提出一套网格计算容错框架,该容错框架包括两个方面:网格错误检测与网格错误处理.本容错框架通过提供一种层次式错误检测方式以及基于策略的通用错误处理方式来保证网格计算的可靠性.错误检测服务按照层次方式组织,最底层是本地错误检测器,它负责收集被检测对象的信息,发送到中间层的数据收集器,中间层数据收集器按照列表方式发送被检测对象的信息到顶层数据收集器.当错误检测器检测到运行错误时,按照决策分析的方法来提供灵活的错误处理方式.对系统的性能评测表明提出的通用网格容错框架具有很好的扩展性、高效性以及较低的额外开销.
A general fault-tolerance framework for grid computing is proposed which are dealt with hierarchical structure fault detection services and policy-based fault-handling method, based on the requirements of reliable grid computing. The bottom of the fault detection service is local fault detector, which monitors the objects in local area and sends heartbeat messages to the middle data collector; the middle data collector sends the status list of the monitored objects to the top data collectors within specific interval; the top data collector is managed by an index server. When any fault detected, the system chooses an appropriate fault-handling method, such as checkpointing, retrying, replication. The results of the performance evaluation show that this framework is scalable, high-efficiency and low-overhead.