C. A. R. M. E. N. : A General-Purpose Computing Facility For Scientific Research Using OSCAR B. Becker a M. J. N. Horner b aUCT-CERN Research Centre, Department of Physics, University of Cape Town, Rondebosch 7701, Cape Town, South Africa. bCurrently based at Lawrence Berkeley National Lab, 1 Cyclotron Road, Berkeley California, USA Abstract Many fields of scientific research today require large amounts of computing power in order to simulate or analyse large data sets. The UCT-CERN Research Centre has used OSCAR to build a general-purpose computing facility in order to fulfill its commitment to the ALICE experiment at CERN. The system, dubbed CARMEN1 has been found to be general enough to become an established tool for a diverse group of scientists from ultra-relativistic heavy-ions physics to trauma unit medical applications. 1 Introduction The UCT-CERN Research Centre 2 at the University of Cape Town, South Africa, is involved in aspects of ultra-relativistic heavy-ion nuclear physics which range from theoretical physics to software engineering and data analysis for current and future heavy-ion collider experiments[1]. The purpose of this field of research is to explore Quantum Chromodynamics (QCD), the fundamental theory of strong interactions which governs nuclear behaviour, in a novel regime of extremely high temperature and pressure similar to the conditions thought to have prevailed at the very earliest stages of the universe. It is predicted that in such regime the fundamental constituents of nuclear matter which are normally bound inside stable nuclei become free and form a deconfined state, the Quark-Gluon Plasma (QGP) [2]. The combined knowledge gained from current experiments and studies of the interactions of all particles with various materials has allowed the community to build detailed simulations of the expected physics environments in future experiments as well as simulate and test the design of future detectors. Particle physics and related 1 Cluster for African Research in Massive Energetic Nuclei 2 UCT-CERN Research Centre home page at http://hep.phy.uct.ac.za Preprint submitted to OSCAR symposium at HPC04, Winnipeg 2 November 2004 fields involve many random processes and are thus well suited for Monte-Carlo simulations. A number of standard packages exist which are used to predict the number and spectra of particles created in heavy-ion collisions and to model the response of various detectors to these particles. One of the most popular, HIJING 3 [4], predicts 104 − 105 particles produced per simulated collision of PbPb nuclei (hereafter referred to as “events”) at 5.5 TeV4 . Usually, in order for a Monte-Carlo study not to be limited by statistical noise, one needs of the order of 104 such events. Typical simulations of expected physics and detector response produce data files of about 1 GB. Required data samples thus reach sizes in the TBs and for systematic studies to study the model dependance of predictions, this increases by 2-3 orders of magnitude. In addition, the ALICE experiment 5 at the LHC is designed to record an average of about 1-2 PB of data per ALICE year 6 . To ensure that efficiencies are calculated properly and to eliminate false signals, the real data set must be complemented by a simulated sample of similar size. Of course, the data then has to be processed and analysed, which requires enormous computing power. It was for this reason that the ALICE Data Grid is being implemented, in order to provide distributed storage and computing for the ALICE experiment. It was decided in 2002 that a computing centre would have to be built in order to facilitate the work at the Centre, which had two goals. The primary and long-term goal was to become a Tier2 7 centre of the ALICE Data Grid. A secondary goal was to provide the users at the UCT-CERN Research Centre with a general computing facility in order to perform simulations of heavy-ion collisions[16] and theoretical models of such events[17,4,5]. After a brief investigation into field of high performance computing, a Linux-cluster setup was decided upon. The reasons for this choice were primarily • the high performance to cost ratio, using COTS8 components and open-source software, and the fairly restricted funding sources. • the existence of a set of tools already developed in the market of cluster computing[ 6] • the potential for customization of the system, given the open-source nature of most tools and • the potential for using the final system as an instructive teaching environment for future students. 3 HIJING: Heavy Ion Jet INteraction Generator 4 nominal values quoted for Large Hadron Collider energies 5 ALICE : A Large Hadron Collider Experiment - the deidicated heavy ion experiment at the Large Hadron Collider, CERN 6 The LHC will only collide heavy ions for about 1 month of the year, with the rest of the time dedicated to other species 7 The ALICE Data Grid is composed of hierarchical Tiers. Tier0 is CERN itself, where the data is stored initially. Tier1 are major computing centres where copies of all the data are stored, while Tier2 are university and smaller labs around the world, where processing power is concentrated 8 Commodity Off The Shelf Due to our close collaboration with CERN9 , which officially uses Red Hat Linux[18], we decided on this distribution of the Linux operating system. 2 Design Considerations and Requirements Apart from financial constraints, the design of the computing centre was founded on the basis of our primary and secondary goals, which translated into a need for a specific solution as well as a more general one. 1 The need for a specific solution The work done at the UCT-CERN Research Centre for the ALICE experiment had to be co-ordinated by the ALICE Data Grid, and it’s computing environment AliEn[8]. This required a specific solution which translated roughly into the following two criteria: • NFS-mounted, writable /home directories • large storage capacity, of the order of a TB. • writable disks near to the CPUs 10 (i.e. local storage as well as network-attached storage) • a reasonably fast network connection on the cluster subnet and a stable network connection of the server to the world-wide web. 2 The need for a general solution In order to satisfy the diverse needs of the users at the Research Centre, the design of the computing facility had to provide a scalable, flexible and cheap solution with a simple interface to interactive use, as well as batch-queue systems. The possibility of collaborative use with other research groups in the university also had to be left open, which put a constraint on the level of optimization possible. The above design considerations led us to settle on the OSCAR[6] toolkit, since it provided ready-made solutions to the problem of cluster creation and maintenance with tools for semi-automated server configuration, node imaging[9], networking, administration[ 10] and monitoring[11] and interfaces to an array of distributed memory[ 12] and batch-queue[13] applications. 9 Centre Europ´een pour la Recherche Nucl´eaire - the European Centre for Nuclear Physics 10 Typical disk I/O rates of MC simulations are about 100 MB/min, per event, which translates to a required bandwidth of 25 MB/s when all nodes are accessing NFS-mounted volumes. This is easily saturated on a 10/100 MB/s subnet, hence the requirement to write data to a local disk, instead of an NFS-mounted one 3 Implementation 1 Physical and Network Setup The 20 nodes 11 are stacked on a 5-level rack, capable of 6 boxes per level and 3 switches. Ethernet cables machined into the frame of the rack allow the nodes are connected to the server via a fast Ethernet switch, capable of handling a total of 26 connections. This number is made up of 24 10/100 MBps ports and 2 gigabit ports to the backplane, which are unused. The /home directories on the server are all NFS mounted on the nodes, as well as a 500 GB capacity partition on the RAID array. This acts as redundant massstorage to which data which is generated on the nodes is copied and stored. This external hardware RAID system is permanently mounted and in semi-permanent use, which causes frequent disk failures 12 . There are plans in the near future to extend the mass-storage to include a distributed network-attached, fault-tolerant storage system, “clusterRAID”[14]. 2 Data Storage Capacity In order to satisfy our requirement of large storage space, a SCSI RAID controller with 8 IDE disk capacity was used as mass storage. The controller currently houses 100 GB disks, giving a usable storage space of 650 GB. This could be upgraded easily by using higher capacity disks. Provision was made for local data storage on a /data partition, which is used for staged-in files, temporary writing during long jobs (auto-saving) and distributed data storage. The distributed disk space is about 1.4 TB, which along with the RAID brings our total storage space to about 2 TB13 . 4 Load-Balancing and Queuing Simply stringing 20 computers together and calling it a cluster does not solve any particular problem; this involves using them efficiently. For this purpose, the batchqueue system OpenPBS comes shipped with OSCAR, as well as an open-source scheduler, MAUI. We chose however to take advantage of PBSPro’s educational lincense and fee waiver. This version of PBS is stable, scales well to very many nodes, and allows many scheduler optimisations. Integration of PBSPro with other OSCAR software was simple and painless. 11 full description of the nodes available elsewhere 12 It is not clear yet whether the disk failures are due to a batch of faulty disks, incompatible scsi driver or a faulty controller. The problem is under investigation 13 This is the current situation; future upgrades to clusteRAID are not discussed yet. 5 Discussion - Applications of CARMEN 1 ROOT - based code Much of our work is high-statistics Monte-Carlo heavy-ion event generation, detector simulation and analysis. The basis of most of the toolkits used for this work is an object-oriented analysis framework developed at CERN, ROOT[19]. This framework is fast becoming the favorite tool amongst a diverse group of physicists and and many experiment-specific toolkits have been built around it 14 , in particular[21,16]. ROOT provides a parallel analysis facility[20] which integrates very well into a Linux-cluster environment. ROOT also provides common interfaces to different Monte-Carlo generators for both physics collisions and particles interactions in materials[22]. Once the particle spectra have been determined or predicted these can be placed in full detailed simulations of the detectors if sufficient parametrisations of particle interactions in materials exist. This allows complete design studies to be done for vastly reduced costs compared to a full-blown prototype. This is not restricted to heavy-ion physics. 2 ALICE dimuon High-Level Trigger Prototyping As part of our commitment to the ALICE experiment, the UCT-CERN Research Centre’s has the responsibility for implementing the High Level trigger for the dimuon arm (dHLT). This is a software and hardware system which interfaces with the front-end electronics of the ALICE dimuon arm detector, in order to calculate on the fly whether an interesting signal has been detected in a particular event or not, in order to reduce the data-to-tape cost of the experiment. It implements a data transport framework[15] developed at the University of Heidelberg and integrates with a trigger algorithm which is implemented partly on FPGA’s and partly on CPU’s of cluster nodes. OSCAR has allowed us to quickly and easily build the properly networked cluster needed in order to build the dHLT prototype. 3 LODOX The first large-scale, non heavy-ion Monte-Carlo study done on CARMEN was of the Low Dosage X-Ray (LODOX) 15 . Using GEANT, a model of the imaging process of this high resolution, low dosage scanning X-Ray machine for use in medical trauma units is being constructed. The purpose of an initial study in this project was to predict the image quality for a given detector and source configuration. The study was completed in about 2 months, with the user having the bare minimum knowledge of Linux and scheduling systems. A similar study with a prototype machine would have taken at least 4 months and the costs would have been prohibitive. 14 See http://root.cern.ch/root/ExApplications.html 15 A joint project between LODOX and the Faculty of Biomedical Engineering at UCT 4 Other Applications A system like CARMEN, built completely on open source software, lends itself naturally to the widest possible use by researchers. Apart from the large projects described above, some smaller projects have made use of our system. These are mainly in the form of locally compiled custom code, usually Monte-Carlo based. Most recently, a full study of core-softened fluid dynamics[23] was completed in less than 1000 hours of CPU time, including compilation and setup 16 . A similar-sized study would’ve taken of the order of 2 months on a single CPU. This one case demonstrates the flexibiltiy and usefulness of a system like CARMEN. 6 Conclusion With the availability of OSCAR and a detailed simulation and analysis framework, we have been able to quickly produce a computing solution to meet the needs of our Centre. In addition to providing our own computing solution it has attracted a diverse community of scientists together and has become the proof-of-concept for a much larger initiative to build a 1000 node computing farm in the near future, the Centre for High Performance Computing in South Africa. 7 Acknowledgements During the time of that this work was done, the authors were supported by an National Research Foundation grantholder-linked bursary. The cluster was built with a hardware bought with a UCT University Research Council Grant and a grant from the National Research Foundation. References [1] ALICE Experiment Technical Proposal : CERN/LHCC-95-71 LHCC/P3 (home page : http://aliweb.cern.ch) [2] T. Matsui and H. Satz, Physics Letters B 178, 416 (1986) [3] European Centre for Nuclear Research : http://cern.ch [4] M. Gyulassy and X-N. Wang, Comp. Phys. Commun 83, 307 (1994), X-N. Wang and M. Gyulassy, Phys. Rev. D44, 3501 (1991) [5] T. Sj¨ostrand et. al, Comput. Phys. Commun. 135, 238 (2001) 16 Again, the user started from the bare minimum of knowledge of PBS and Linux in general [6] T. Naughton et. al, Proceedings of“Commodity, High Performance Cluster Computing Technologies and Application”, Orlando (FL), USA (2002) oscar.sourceforge.net [7] http://alien.cern.ch [8] Alice Environment system home page : http://alien.cern.ch [9] System Installation Suite : http://sisuite.org/ [10] M. Brim, R. Flanery, A. Geist and S. Scott, “Parallel and Distributed Computing Practices”, DAPSYS Special Edition (2002) [11] Ganglia home page : http://ganglia.sourceforge.net [12] http://www.lam-mpi.org/, http://www-unix.mcs.anl.gov/mpi/mpich/, http://www.csm.ornl.gov/pvm/ [13] www.openpbs.org/, http://www.supercluster.org [14] Arne Wiebalck, Peter T. Breuer, Volker Lindenstruth, and Timm M. Steinbeck, ”Fault-tolerant Distributed Mass Storage for LHC Computing”, in Proceedings of the Third IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 2003) 12-15 May 2003, Tokyo, Japan, pp. 266-273. http://www.kip.uniheidelberg. de/ti/ClusterRAID/ [15] Timm M. Steinbeck, “A Modular and Fault-Tolerant Data Transport Framework” - ariv:cs.DC/0404014 [16] http://alisoft.cern.ch/offline [17] THERMUS; thesis of Spencer Wheaton (wheaton@hostess.phy.uct.ac.za) [18] Red Hat Linux distribution home page : http://www.redhat.com [19] R. Brun, F. Rademakers, Nucl.Instrum. Meth. A 389, 81 (1997) [20] “The PROOF Distributed Parallel Analysis Framework based on ROOT” M. Ballantijn, G. Roland, R. Brun and F. Rademakers, presented at “Computing in High Energy and Nuclear Physics”, (CHEP03), La Jolla, CA, USA, March 2003. arv:physics/0306110 [21] PHOBOS Experiment computing home page : http://www.phobos.bnl.gov/Phat/ [22] GEANT Detector Simulation and Description Tool : CERN Program Library Long Writeup W5013 [23] P. Mausbach and H.-O. May, Fluid Phase Equilibria 213 (2003) |