高能物理数据由物理事例组成,事例之间没有相关性。可以通过大量作业同时处理大量不同的数据文件,从而实现高能物理计算任务的并行化,因此高能物理计算是典型的高吞吐量计算场景。高能所计算集群使用开源的TORQUE/Maui进行资源管理及作业调度,并通过将集群资源划分成不同队列以及限制用户最大运行作业数来保证公平性,然而这也导致了集群整体资源利用率非常低下。SLURM和HTCondor都是近年来流行的开源资源管理系统,前者拥有丰富的作业调度策略,后者非常适合高吞吐量计算,二者都能够替代老旧、缺乏维护的TORQUE/Maui,都是管理计算集群资源的可行方案。在SLURM和HTCondor测试集群上模拟大亚湾实验用户的作业提交行为,对SLURM和HTCondor的资源分配行为和效率进行了测试,并与相同作业在高能物理研究所TORQUE/Maui集群上的实际调度结果进行了对比,分析了SLURM及HTCondor的优势和不足,探讨了使用SLURM或HTCondor管理高能物理研究所计算集群的可行性。
High energy physics data consist of multiple events,among which there is no relativity.A high energy physics computing mission is parallelized by running multiple jobs processing multiple different data files simultaneously.Therefore,high energy physics computing is a typical high throughput computing scenario.The computer cluster running at the institute of high energy physics(IHEP)uses the open-source TORQUE/Maui for resource management and job scheduling.IHEP keeps a fair-use policy by dividing the computing resources of this cluster into multiple queues,and limiting the maximum number of running jobs of each user.However,this leads up to a low overall resource usage of the cluster.SLURM and HTCondor are both popular open-source resource management system.SLURM has plenty of job scheduling policy,while HTCondor well suits high throughput computing.Both of them are the possible solutions of resource management for computer clusters,replacing old,lack-of-service TORQUE/Maui.In this paper,job submission behavior of users from Daya Bay experiment was simulated at SLURM and HTCondor testing cluster,testing the resource allocation behaviors and efficiencies of SLURM and HTCondor.Their scheduling results were then compared with the actual scheduling result of the same jobs on IHEP TORQUE/Maui cluster.Finally the strengths and weaknesses of SLURM and HTCondor were analyzed,and the practicability of using SLURM or HTCondor to manage the IHEP computer cluster was discussed.