随着并行计算机系统规模的扩大,系统可用性面临很大的挑战,对大规模并行计算机系统可用性进行量化评估能为系统分析和设计提供有力的支持.根据任务和采用的容错策略,使用随机行为网建立了两个不同实例的并行计算机系统面向用户的可用性模型,模型在节点模块和网络模块基础上描述了任务执行的具体情况,并以执行中的有用工作比率作为可用度指标.最后结合实际数据进行了求解和分析.同一个系统下不同应用可能会反映给用户有较大差异的可用性特征,使用面向用户的并行计算机系统可用性模型可以较为精确地量化这种差异.
The scale of parallel computer systems is even larger. The dependability of the system and the tasks face the great challenges in the situation. The availability include the reliability and serviceability, thereby it is the core specification of describing the correct service capabilities in a massively parallel computer system. The quantitative evaluation of availability of massively parallel computer system is significant for system analysis and design. The user-oriented availability models of parallel computer system which consider task characters and fault tolerance strategy are established by stochastic activity networks for two different examples in this paper: one is capability computing application with frequent communication among nodes, and the other is capacity computing application without communication. These models based on node module and networks module describe task running states and use useful work rate to measure the availability degree. The model includes the main factors that influence the availability of parallel computer system, which involve failure, hierarchical fault-tolerance, fault detect, application characteristics, repair strategy and faulty coverage ratio, etc. Then, the model is computed and analyzed with the actual data. The models can evaluate the user-oriented availability quantitatively, especially when the tasks are different and the parallel computer systems are the same.