2.3LIGO
The Einstein@Home data analysis application that searches for gravitational radiation from spinning neutron stars using data from the Laser Interferometer Gravitational Wave Observatory (LIGO) detectors was identified as an excellent LIGO application for production runs on OSG. Because this particular search is virtually unbounded in the scientific merit achieved by additional computing resources, any available resource can provide additional scientific benefit. The original code base for Einstein@Home has evolved significantly during the first half of this year to support the use of Condor-G for job submission and job management. This has removed internal dependencies in the code base on the particular job manager in use at various grid sites around the world. Several other modifications to the code ensued to address stability, reliability and performance. By May 2009, the code was running reliably in production on close to 20 sites across the OSG that support job submission from the LIGO Virtual Organization.
Since late summer of 2009 the number of wall clock hours utilized by Einstein at home on the OSG have nearly doubled from what was typically available in the first half of the year. Since January 1, 2009, more than 5 million wall clock hours have been contributed by the Open Science Grid to this important LIGO data analysis application (Error: Reference source not found). This is primarily due to effort made to deploy and run the application on a larger number of OSG sites, and working with local site authorities to understand local policies and how best to support them in the running application.
Figure : OSG Usage by LIGO's Einstein@Home application for the year running at production levels. The new code using the Condor-G job submission interface began in March of 2009. Increases seen since early September are due to increases in number of sites compatible with code.
In the past year, LIGO has also begun to investigate ways to migrate the data analysis workflows searching for gravitational radiation from binary black holes and neutron stars onto the Open Science Grid for production scale utilization. The binary inspiral data analyses typically involve working with tens of terabytes of data in a single workflow. Collaborating with the Pegasus Workflow Planner developers at USC-ISI, LIGO has identified changes to both Pegasus and to the binary inspiral workflow codes to more efficiently utilize the OSG where data must be moved from LIGO archives to storage resources near the worker nodes on OSG sites. One area of particular focus has been on the understanding and integration of Storage Resource Management (SRM) technologies used in OSG Storage Element (SE) sites to house the vast amounts of data used by the binary inspiral workflows so that worker nodes running the binary inspiral codes can effectively access the data. A SRM Storage Element has been established on the LIGO Caltech OSG integration testbed site. This site has 120 CPU cores with approximately 30 terabytes of storage currently configured under SRM. The SE is using BeStMan and Hadoop for the distributed file system shared among the worker nodes. Using Pegasus for the workflow planning, workflows for the binary inspiral data analysis application needing tens of terabytes of LIGO data have successfully run on this system. Effort is now underway to translate this success onto OSG productions sites that support the SRM storage element.
LIGO has also been working closely with the OSG, DOE Grids, and ESnet to evaluate the implications of its requirements on authentication and authorization within its own LIGO Data Grid user community and how these requirements map onto the security model of the OSG.
2.4ALICE
The ALICE experiment at the LHC relies on a mature grid framework, AliEn, to provide computing resources in a production environment for the simulation, reconstruction and analysis of physics data. Developed by the ALICE Collaboration, the framework has been fully operational for several years, deployed at ALICE and WLCG Grid resources worldwide. This past year, in conjunction with plans to deploy ALICE Grid resources in the US, the ALICE VO and OSG have begun the process of developing a model to integrate OSG resources into the ALICE Grid.
In early 2009, an ALICE-OSG Joint task force was formed to support the inclusion of ALICE Grid activities in OSG. The task force developed a series of goals leading to a common understanding of AliEn and OSG architectures. The OSG Security team reviewed and approved a proxy renewal procedure common to ALICE Grid deployments for use on OSG sites. A job-submission mechanism was implemented whereby an ALICE VO-box service deployed on the NERSC-PDSF OSG site, submitted jobs to the PDSF cluster through the OSG interface. The submission mechanism was activated for ALICE production tasks and operated for several months. The task force validated ALICE OSG usage through normal reporting means and verified that site operations were sufficiently stable for ALICE production tasks at low job rates and with minimal data requirements as allowed by the available local resources.
The ALICE VO is currently a registered VO with OSG, supports a representative in the OSG VO forum and an Agent to the OSG-RA for issuing DOE Grid Certificates to ALICE collaborators. ALICE use of OSG will grow as ALICE resources are deployed in the US. These resources will provide the data storage facilities needed to expand ALICE use of OSG and add compute capacity on which the AliEn-OSG interface can be utilized at full ALICE production rates.
2.5D0 at Tevatron
The D0 experiment continues to rely heavily on OSG infrastructure and resources in order to achieve the computing demands of the experiment. The D0 experiment has successfully used OSG resources for many years and plans on continuing with this very successful relationship into the foreseeable future.
All D0 Monte Carlo simulation is generated at remote sites, with OSG continuing to be a major contributor. During the past year, OSG sites simulated 390 million events for D0, approximately 1/3 of all production. This is an increase of 62% relative to the previous year. The reason for this increase is due to many factors. The opportunistic use of storage elements has improved. The SAM infrastructure has been improved resulting in aan increased OSG job efficiency. In addition many sites have modified their preemption policy which improves the throughput for the D0 jobs. Improved efficiency, increased resources, (D0 used 24 sites in the past year and uses 21 regularly), automated job submission, use of resource selection services and expeditious use of opportunistic computing continue to play a vital role in high event throughput. The total number of D0 OSG MC events produced over the past several years is 726 million events (Error: Reference source not found).
Over the past year, the average number of Monte Carlo events produced per week by OSG continues to remain approximately constant. Since we use the computing resources opportunistically, it is interesting to find that, on average, we can maintain an approximately constant rate of MC events. When additional resources become available (April and May of 2009 in Error: Reference source not found) we are quickly able to take advantage of this. Over the past year we were able to produce over 10 million events/week on nine separate occasions and a record of 13 million events/week was set in May 2009. With the turn on of the LHC experiments, it will be more challenging to obtain computing resources opportunistically. It is therefore very important that D0 use the resources it can obtain efficiently. Therefore D0 plans to continue to work with OSG and Fermilab computing to continue to improve the efficiency of Monte Carlo production on OSG sites.
The primary processing of D0 data continues to be run using OSG infrastructure. One of the very important goals of the experiment is to have the primary processing of data keep up with the rate of data collection. It is critical that the processing of data keep up in order for the experiment to quickly find any problems in the data and to keep the experiment from having a backlog of data. Typically D0 is able to keep up with the primary processing of data by reconstructing nearly 6 million events/day. However, when the accelerator collides at very high luminosities, it is difficult to keep up with the data using our standard resources. However, since the computing farm and the analysis farm have the same infrastructure, D0 is able to increase its processing power quickly to improve its daily processing of data. During the past year, D0 moved some of its analysis nodes to processing nodes in order to increase its throughput. Error: Reference source not found shows the cumulative number of data events processed per day. In March the additional nodes were implemented and one can see the improvement in events reconstructed. These additional nodes allowed D0 to quickly finish processing its collected data and the borrowed nodes were then returned to the analysis cluster. Having this flexibility to move analysis nodes to processing nodes in times of need is a tremendous asset. Over the past year D0 has reconstructed nearly 2 billion events on OSG facilities.
OSG resources have allowed D0 to meet is computing requirements in both Monte Carlo production and in data processing. This has directly contributed to D0 publishing 37 papers in 2009, see http://www-d0.fnal.gov/d0_publications/#2009
Figure : Cumulative number of D0 MC events generated by OSG during the past year.
Figure : Number of D0 MC events generated per week by OSG during the past year.
Figure : Cumulative number of D0 data events processed by OSG infrastructure. In March additional nodes were added to the processing farm to increase the throughput of the processing farm. The flat region in August and September corresponds to the time when the accelerator was down for maintenance so no events needed to be processed.
Share with your friends: |