We continue to work and make progress on the challenges faced to meet our longer-term goals:
-
Dynamic sharing of tens of locally owned, used and managed compute and storage resources, with minimal additional human effort (to use and administer, and limited negative impacts on the communities owning them.
-
Utilizing shared storage with other than the owner group - not only more difficult than (the quantized) CPU cycle sharing, but also less well supported by the available middleware.
-
Federation of the local and community identity/authorization attributes within the OSG authorization infrastructure.
-
The effort and testing required for inter-grid bridges involves significant costs, both in the initial stages and in continuous testing and upgrading. Ensuring correct, robust end-to-end reporting of information across such bridges remains fragile and human effort intensive.
-
Validation and analysis of availability and reliability testing, accounting and monitoring information. Validation of the information is incomplete, needs continuous attention, and can be human effort intensive.
-
The scalability and robustness of the infrastructure has not yet reached the scales needed by the LHC for analysisfor analysis operations in the out years. The US LHC software and computing leaders have indicated that OSG needs to provide ~x2 in interface performance over the next year or two, and the robustness to upgrades and configuration changes throughout the the infrastructure needs to be improved.
-
Full usage of available resources. A job “pull” architecture (e.g., the Pilot mechanism) provides higher throughput and better management than one based on static job queues, but now we need to move to the next level of effective usage.
-
Automated site selection capabilities arecapabilities are inadequately deployed and are also embryonic in the capabilities needed – especially when faced with the plethora of errors and faults that are naturally a result of a heterogeneous mix of resources and applications with greatly varying I/O, CPU and data provision and requirements.
-
User and customer frameworks are important for engaging non-Physics communities in active use of grid computing technologies; for example, the structural biology community has ramped up use of OSG enabled via portals and community outreach and support.
-
A common operations infrastructure across heterogeneous communities can be brittle. Efforts to improve the early detection of faults and problems before they impact the users help everyone.
-
Analysis and assessment of and recommendations as a result of usage, performance, accounting, accounting and monitoring information is a key need which requires dedicated and experienced effort.
-
Transitioning students from the classroom to be users is possible but continues as a challenge, partially limited by the effort OSG can dedicate to this activity.
-
Use of new virtualization, multi-core and job parallelism techniques, scientific and commercial cloud computing. We have two new satellite projects funded in these areas: the first on High Throughput Parallel Computing (HTPC) on OSG resources for an emerging class of applications where large ensembles (hundreds to thousands) of modestly parallel (4- to ~64- way) jobs; the second a research project to do application testing over the ESNETESnet 100-Gigabit network prototype, using the storage and compute end-points supplied by the Magellan cloud computing at ANL and NERSC.
-
Collaboration and partnership with TeraGrid, under the umbrella of a Joint Statement of Agreed Upon Principles signed in August 2009. New activities have been undertaken, including representation at one another’s management meetings, tests on submitting jobs to one another’s infrastructure and exploration of how to accommodate the different resource access mechanisms of the two organizations.
These challenges are not unique to OSG. Other communities are facing similar challenges in educating new entrants to advance their science through large-scale distributed computing resources.
2.Contributions to Science 2.1ATLAS
In preparation for the startup of the LHC collider at CERN in December, the ATLAS collaboration was performing computing challenges of increasing complexity and scale. Monte Carlo production is ongoing with some 50,000 concurrent jobs worldwide, with about 10,000 jobs running on resources provided by the distributed ATLAS computing facility operated in the U.S., comprising the Tier-1 center at BNL and 5 Tier-2 centers located at 9 different institutions. As the facility completed its readiness for distributed analysis a steep ramp-up of analysis jobs in particular at the Tier-2 sites was observed.
Based upon the results achieved during the computing challenges and at the start of ATLAS data taking we can confirm that the tiered, grid-based, computing model is the most flexible structure currently conceivable to process, reprocess, distill, disseminate, and analyze ATLAS data. We found, however, that the Tier-2 centers may not be sufficient to reliably serve as the primary analysis engine for more than 400 U.S. physicists. As a consequence a third tier with computing and storage resources located geographically close to the researchers was defined as part of the analysis chain as an important component to buffer the U.S. ATLAS analysis system from unforeseen, future problems. Further, the enhancement of U.S. ATLAS institutions’ Tier-3 capabilities is essential and is planned to be built around the short and long-term analysis strategies of each U.S. group.
An essential component of this strategy is the creation of a centralized support structure, because the considerable obstacle to creating and sustaining campus-based computing clusters is the continuing support required. In anticipation of not having access to IT professionals to install and maintain these clusters U.S. ATLAS at each institution has spent a considerable amount of effort to develop an approach for a low maintenance implementation of Tier-3 computing. While computing expertise in U.S. ATLAS was sufficient to develop ideas on a fairly high level only the combined expert knowledge and associated effort from U.S. ATLAS and OSG facilities personnel has eventually resulted in a package that is easily installable and maintainable by scientists.
Dan Fraser of the Open Science Grid (OSG) has organized regular Tier 3 Liaison meetings between several members of the OSG facilities, U.S. Atlas and U.S. CMS. During these meetings, topics discussed include cluster management, site configuration, site security, storage technology, site design, and experiment-specific Tier-3 requirements. Based on information exchanged at these meetings several aspects of the U.S. Atlas Tier-3 design were refined. Both the OSG and U.S. Atlas Tier-3 documentation was improved and enhanced.
OSG has hosted two multi-day working meetings at the University of Wisconsin in Madison with members from the OSG Virtual Data Toolkit (VDT) group, Condor development team, the OSG Tier-3 group and U.S. Atlas attending. As a result U.S. Atlas changed the way howthat user accounts were to be handled based on discussions with the system manager of the University of Wisconsin GLOW cluster; thus leading to a simpler design for smaller Tier-3 sites. Also, the method for Condor batch software installation using yum repositories and RPM files were developed. As a result the Condor and VDT teams created an official yum repository for software installation similar to how the Scientific Linux operation system is installed. Detailed instructions were written for the installation of U.S. Atlas Tier-3 sites, including instructions for components as complex as storage management systems based on xrootd.
Following the trips to Madison, U.S. Atlas installed an entire Tier 3 cluster using virtual machines on one multi core desktop machine. This virtual cluster is used for documentation development and U.S. Atlas Tier-3 administrator training. Marco Mambelli of OSG has provided much assistance in the configuration and installation of the software for the virtual cluster. Also, OSG supports a crucial user software package (wlcg-client with support for xrootd added by OSG) used by all U.S. Atlas Tier 3 users, a package that enhances and simplifies the user environment at the Tier 3 sites.
Today U.S. ATLAS (contributing to ATLAS as a whole) relies quite extensively on services and software provided by OSG, as well as on processes and support systems that have been produced or evolved by OSG. Partly this originates in the fact that U.S. ATLAS fully committed itself to relying on OSG for several years now. Over the course of the last about 3 years U.S. ATLAS itself invested heavily in OSG in many aspects – human and computing resources, operational coherence and more. In addition and essential to the operation of the worldwide distributed ATLAS computing facility the OSG efforts have aided the integration with WLCG partners in Europe and Asia. The derived components and procedures have become the basis for support and operation covering the interoperation between OSG, EGEE, and other grid sites relevant to ATLAS data analysis. OSG provides Software components that allow interoperability with European ATLAS sites, including selected components from the gLite middleware stack such as LCG client utilities (for file movement, supporting space tokens as required by ATLAS), and file catalogs (server and client).
It is vital to U.S. ATLAS that the present level of service continue uninterrupted for the foreseeable future, and that all of the services and support structures upon which U.S. ATLAS relies today have a clear transition or continuation strategy.
Based on experience and observations U.S. ATLAS made a suggestion to start work on a coherent middleware architecture rather than having OSG to continue providing a distribution as a heterogeneous software system consisting of components contributed by a wide range of projects. One of the difficulties we ran into several times was due to inter-component functional dependencies that can only be avoided if there is good communication and coordination between component development teams. A technology working group, chaired by a member of the U.S. ATLAS facilities group (John Hover, BNL) is asked to help the OSG Technical Director by investigating, researching, and clarifying design issues, resolving questions directly, and summarizing technical design trade-offs such that the component project teams can make informed decisions. In order to achieve the goals as they were set forth, OSG needs an explicit, documented system design, or architecture so that component developers can make compatible design decisions and virtual organizations (VO) such as U.S. ATLAS can develop their own application based on the OSG middleware stack as a platform.
As a short-term goal the creation of a design roadmap is in progress.
Regarding support services the OSG Grid Operations Center (GOC) infrastructure at Indiana University is at the heart of the operations and user support procedures. It is also integrated with the GGUS infrastructure in Europe making the GOC a globally connected system for worldwide ATLAS computing operation.
Middleware deployment support is an essential and complex function the U.S. ATLAS facilities are fully dependent upon. The need for continued support for testing, certifying and building a middleware as a solid foundation our production and distributed analysis activities runs on was served very well so far and will continue to exist in the future, as does the need for coordination of the roll out, deployment, debugging and support for the middleware services. In addition the need for some level of preproduction deployment testing has been shown to be indispensable and must be maintained. This is currently supported through the OSG Integration Test Bed (ITB) providing the underlying grid infrastructure at several sites with a dedicated test instance of PanDA, the ATLAS Production and Distributed Analysis system running on top of it. This implements the essential function of validation processes that accompany incorporation of new and new versions of grid middleware services into the VDT, the coherent OSG software component repository. In fact U.S. ATLAS relies on the VDT and OSG packaging, installation, and configuration processes that lead to a well-documented and easily deployable OSG software stack.
U.S. ATLAS greatly benefits from OSG's Gratia accounting services, as well as the information services and probes that provide statistical data about facility resource usage and site information passed to the application layer and to WLCG for review of compliance with MOU agreements.
One of the essential parts of grid operations is that of operational security coordination. The coordinator is provided by OSG today, and relies on good contacts to security representatives at the U.S. ATLAS Tier-1 center and Tier-2 sites. Thanks to activities initiated and coordinated by OSG a strong operational security community has grown up in the U.S. in the past few years, driven by the needs of ensuring that security problems are well coordinated across the distributed infrastructure.
Moving on to ATLAS activities that were performed on the OSG facility infrastructure in 2009 the significant feature, besides dissemination and processing of collision data late in the year, was the preparation for and the execution of the STEP’09 (Scale Testing for the Experimental Program 2009) exercise, which provided a real opportunity to perform a significant scale test with all experiments scheduling exercises in synchrony with each other. The main goals were to utilize specific aspects of the computing model that had not previously been fully tested. The two most important tests were the reprocessing at Tier-1 centers with the experiments recalling the data from tape at the full anticipated rates, and the testing of analysis workflows at the Tier-2 centers. In the U.S., and in several other regions, the results were very encouraging with the Tier-2 centers exceeding the metrics set by ATLAS. As an example, with more than 150,000 successfully completed analysis jobs at 94% efficiency U.S. ATLAS took the lead in ATLAS. This was achieved while the Tier-2s were also continuing to run Monte Carlo production jobs, scaled to keep the Tier-2 resources fully utilized. Not a single problem with the OSG provided infrastructure was encountered during the execution of the exercise. However, there is an area of concern that may impact the facilities’ performance in the future. As we continuously increase the number of job slots at sites the performance of pilot submission through CondorG and the underlying Globus Toolkit 2 (GT2) based gatekeeper has to keep up without slowing down the job throughput in particular when running short jobs. When addressing this point with the OSG facilities team we found that they were open to evaluate and incorporate recently developed components, like the CREAM CE provided by EGEE developers in Italy. There are already intensive tests conducted by the Condor team in Madison and integration issues were discussed in December between the PanDA team and Condor developers.
Other important achievements are connected with the start of LHC operations on November 23rd. As all ATLAS detector components were working according to their specifications, including the data acquisition, trigger chain and prompt reconstruction at the Tier-0 center at CERN timely data dissemination to and stable operations of the facility services in the U.S. were the most critical aspects we were focusing on. We are pleased to report that throughout the (short) period of data taking (ended on December 16th) data replication was observed at short latency, with data transfers from the Tier-0 to the Tier-1 center at BNL typically completed within 2.6 hours and transfers from the Tier-1 to 5 Tier-2 and 1 Tier-3 centers completed in 6 hours on average. The full RAW dataset (350 TB) and the derived datasets were sent to BNL, with the derived datasets (140 TB) further distributed to the Tier-2s. About 100 users worldwide started immediately to analyze the collision data, with 38% of the analysis activities observed to run on the resources provided in the U.S. Due to improved alignment and condition data that became available following first pass reconstruction of the RAW data at CERN a worldwide reprocessing campaign was conducted by ATLAS over the Christmas holidays. The goal of the campaign was to process all good runs at the Tier-1 centers and have the results ready by January 1st, 2010. All processing steps (RAW -> ESD, ESD -> AOD, HIST, NTUP and TAG) and the replication of newly produced datasets to the Tier-1 and Tier-2 centers worldwide was completed well before the end of the year.
ATLAS
Figure : OSG CPU hours used by ATLAS (57M hours total), color coded by facility.
In the area of middleware extensions, US ATLAS continued to benefit from the OSG's support for and involvement in the US ATLAS developed distributed processing and analysis system (PanDA) layered over the OSG's job management, storage management, security and information system middleware and services, providing a uniform interface and utilization model for the experiment's exploitation of the grid, extending not only across OSG but EGEE and Nordugrid as well. PanDA is the basis for distributed analysis and production ATLAS-wide, and is also used by the OSG itself as a WMS available to OSG VOs, with the addition this year of a PanDA based service for OSG Integrated Testbed (ITB) test job submission, monitoring and automation. This year the OSG's WMS extensions program contributed the effort and expertise essential to a program of PanDA security improvements that culminated in the acceptance of PanDA's security model by the WLCG, and the implementation of that model on the foundation of the glexec system that is the new OSG and EGEE standard service supporting the WLCG-approved security model for pilot job based systems. The OSG WMS extensions program supported Maxim Potekhin and Jose Caballero at BNL, working closely with the ATLAS PanDA core development team at BNL and CERN, who provided the design, implementation, testing and technical liaison (with ATLAS computing management, the WLCG security team, and developer teams for glexec and ancillary software) for PanDA's glexec-based pilot security system. This system is now ready for deployment once glexec services are offered at ATLAS production and analysis sites. The work was well received by ATLAS and WLCG, both of which have agreed that in a new round of testing to proceed in early 2010 (validating a new component of the glexec software stack, ARGUS), other experiments will take the lead role in testing and validation in view of the large amount of work contributed by Jose and Maxim in 2009. This will allow Jose and Maxim more time in 2010 for a new area we are seeking to build as a collaborative effort between ATLAS and OSG, in WMS monitoring software and information systems. ATLAS and US ATLAS are in the process of merging what have been distinct monitoring efforts, a PanDA/US effort and a WLCG/CERN/ARDA effort, together with a newly opening development effort in an overarching ATLAS Grid Information System (AGIS) that will integrate ATLAS-specific information with the grid information systems. Discussions are beginning with the objective of integrating this merged development effort with the OSG program in information systems and monitoring. US ATLAS and the OSG WMS effort are also continuing to seek ways in which the larger OSG community can leverage PanDA and other ATLAS developments. In response to feedback from prospective PanDA users outside ATLAS, a simple web-based data handling system was developed that integrates well with PanDA data handling to allow VOs using PanDA to easily integrate basic dataflow as part of PanDA-managed processing automation. We also look forward to the full deployment of PanDA for ITB testbed processing automation. In other joint OSG work, CondorG constitutes the critical backbone of PanDA's pilot job submission system, and as mentioned earlier in this report we have benefited greatly from several rounds of consultation and consequent Condor improvements to increase the scalability of the submission system for pilot jobs. We expect that in 2010 the OSG and its WMS effort will continue to be the principal source for PanDA security enhancements, and an important contributor to further integration with middleware (particularly Condor), and scalability/stress testing.
Share with your friends: |