为确保建成的中国科学院"十一五"信息化重大专项超级计算环境提供稳定可靠的服务,提出三层架构超级计算环境的容错框架。对计算环境可靠性和计算节点可靠性两大部分,通过作业可靠性、服务可靠性和网格节点可靠性三个主要方面的可靠性研究,提出并实现了三层架构超级计算环境的可靠性解决方案。该框架重点解决了单点故障对环境的影响,确保单点故障发生后系统能够继续提供高可用的高性能计算服务。
Based on the three-layer supercomputing environment,developed a resilience framework for this environment to provide stable and reliable supercomputing service.Designed the solution to the resilience of the environment and implemented the solution job based on the reliable research works,grid service and grid node.By means of the resilience framework,solved the single failure in the environment.High available supercomputing service can now be provided in the supercomputing environment.