Table of Contents Executive Summary 3



Download 337.55 Kb.
Page14/20
Date08.01.2017
Size337.55 Kb.
#7901
1   ...   10   11   12   13   14   15   16   17   ...   20

3.9Metrics and Measurements


OSG Metrics and Measurements strive to give OSG management, VOs, and external entities quantitative details of the project’s growth throughout its lifetime. The focus for FY09 was the “internal metrics” report, where OSG Metrics collaborated with all the internal OSG areas to come up with quantitative metrics for each area. The goal was to help each area become aware of their progress throughout the year. This work culminated in OSG document #887 done in September 2009.

OSG Metrics works with OSG Gratia operations to validate the data coming from the accounting system. This accounting data is sent to the WLCG on behalf of the USLHC VOs and is used for many reports and presentations to the US funding agencies. Based on the Gratia databases, the OSG Metrics team maintains a website that integrates with the accounting system and provides an expert-level view of all the OSG accounting data. This data includes RSV test results, job records, transfer records, storage statistics, cluster size statistics, and historical batch system status for most OSG sites. During FY10, we will be working to migrate the operation of this website to OSG operations and integrate the data further with MyOSG (in order to make it more accessible and user-friendly).

OSG Metrics collaborates with the storage area to expand the number of sites reporting transfers to the accounting system. We have begun to roll out new storage reporting, based on new software capabilities developed for Gratia. The focus of the storage reporting is to provide a general overview of site usage centrally, but it also provides a wide range of detailed statistics to local system administrators. The Metrics and Measurements area will also begin investigating how to incorporate the network performance monitoring data into its existing reporting activities. Several VOs are actively deploying perfSONAR- based servers that provide site-to-site monitoring of the network infrastructure, and we would like to further encourage network measurements on the OSG.

The Metrics and Measurements area continues to be involved with the coordination of WLCG-related reporting efforts. It sends monthly reports to the JOT that highlights MOU sites’ monthly availability and activity. It produces monthly data for the metric thumbnails and other graphs available on the OSG homepage. OSG Metrics participated in the WLCG Installed Capacity discussions during FY09. The Installed Capacity project was tasked with providing the WLCG with accurate and up-to-date information on the amount of CPU and storage capacity available throughout the worldwide infrastructure. The OSG worked on the USLHC experiment’s behalf to communicate their needs and put the reporting infrastructure in place using MyOSG/OIM. The accounting extensions effort requires close coordination with the Gratia project at FNAL, and includes effort contributed by ATLAS.

Based on a request from the OSG Council, we prepared a report on the various field of sciences utilizing the grid. This focused on a report detailing the number of users and number of wall clock hours per field. The majority of the effort was spent working with the community VOs (as opposed to an experiment VO) on a means to classify their users. Also from stakeholder input, the OSG assessed the various areas of contribution and documented the value delivered by OSG. The goal of this activity was to develop and estimate the benefit and cost effectiveness, and thus provide a basis for discussion of the value of the Open Science Grid (OSG).

3.10Extending Science Applications


In addition to operating a facility, the OSG includes a program of work that extends the support of Science Applications both in terms of the complexity as well as the scale of the applications that can be effectively run on the infrastructure. We solicit input from the scientific user community both as it concerns operational experience with the deployed infrastructure, as well as extensions to the functionality of that infrastructure. We identify limitations, and address those with our stakeholders in the science community. In the last year of work, the high level focus has been threefold: (1) improve the scalability, reliability, and usability as well as our understanding thereof; (2) establish and operate a workload management system for OSG operated VOs; and (3) establish the capability to use storage in an opportunistic fashion at sites on OSG.

We continued with our previously established processes designed to understand and address the needs of our primary stakeholders: ATLAS, CMS, and LIGO. The OSG ”senior account managers” responsible for interfacing to each of these three stakeholders meet, at least quarterly, with their senior management to go over their issues and needs. Additionally, we document the stakeholder desired worklists from OSG and crossmap these requirements to the OSG WBS; these lists are updated quarterly and serve as a communication method for tracking and reporting on progress.


3.11Scalability, Reliability, and Usability


As the scale of the hardware that is accessible via the OSG increases, we need to continuously assure that the performance of the middleware is adequate to meet the demands. There were four major goals in this area for the last year and they were met via a close collaboration between developers, user communities, and OSG.

  1. At the job submission client level, the goal is 20,000 jobs running simultaneously and 200,000 jobs run per day from a single client installation, and achieving in excess of 95% success rate while doing so. The job submission client goals were met in collaboration with CMS, CDF, Condor, and DISUN, using glideinWMS. This was done via a mix of controlled environment and large scale challenge operations across the entirety of the WLCG. For the controlled environment tests, we developed an “overlay grid” for large scale testing on top of the production infrastructure. This test infrastructure provides in excess of 20,000 batch slots across a handful of OSG sites. An initial large-scale challenge operation was done in the context of CCRC08, the 2008 LHC computing challenge. Condor scalability limitations across large latency networks were discovered and this led to substantial redesign and reimplementation of core Condor components, and subsequent successful scalability testing with a CDF client installation in Italy submitting to the CMS server test infrastructure on OSG. Testing with this testbed exceeded the scalability goal of 20,000 running simultaneously and 200,000 jobs per day across the Atlantic. This paved the way for production operations to start in CMS across the 7 Tier-1 centers. Data Analysis Operations across the roughly 50 Tier-2 and Tier-3 centers available worldwide today is more challenging, as expected, due to the much more heterogeneous level of support at those centers. However, during STEP09, the 2009 LHC data analysis computing challenge, a single glideinWMS instance exceeded 10000 jobs running simultaneously over about 50 sites worldwide with a success rate in excess of 95%. Success rate here excludes storage as well as application failures. For the coming years, CMS is expecting to have access to up to 40,000 batch slots across WLCG, so Condor is making further modifications to reach that goal. CMS also requested for glideinWMS to become more firewall friendly, which also affects Condor functionality. We are committed to help the Condor team by performing functionality and scalability tests.

  2. At the storage scheduling level, the present goal is to have 50Hz file handling rates, significantly up from the 1Hz rate goal of last year. An SRM scalability of up to 100Hz was achieved using BeStMan and HadoopFS at UNL, CalTech, and UCSD. DCache based SRM is however still limited below 10Hz. The increase in the goal was needed in order to cope with the increasing scale of operations by the large LHC VOs Atlas and CMS. The driver here is stage-out of files produced during data analysis. The large VOs find that the dominant source of failure in data analysis is the stage-out of the results, followed by read-access problems as second most likely failure. In addition, storage reliability is receiving much more attention now, given its measured impact on job success rate. In a way, the impact of jobs on storage has become more and more of a visible issue in part because of the large improvements in the submission tools, monitoring, and error accounting within the last two years. The improvements in submission tools coupled with the increased scale of resources available is driving up the load on storage. The improvements in monitoring and error accounting are allowing us to fully identify the sources of errors. OSG is very actively engaged in understanding the issues involved, working with both the major stakeholders as well as partner grids.

  3. At the functionality level, this year’s goal was to improve the capability of opportunistic space use. It has been deployed at several dCache sites on OSG, and successfully used by D0 for the production operations on OSG. CDF is presently in the testing stage for adapting opportunistic storage into their production operations on OSG. However, a lot of work is left to do before opportunistic storage is an easy to use and widely available capability of OSG.

  4. OSG has successfully transition to a “bridge model” with regard to WLCG for its information, accounting, and availability assessment systems. This implies that there are aggregation points at the OSG GOC via which all of these systems propagate information about the entirety of OSG to WLCG. For the information system this implies a single point of failure, the BDII at the OSG GOC. If this service fails then all resources on OSG disappear from view. ATLAS and CMS have chosen different ways of dealing with this. While ATLAS maintains its own “cached” copy of the information inside Panda, CMS depends on the WLCG information system. To understand the impact of the CMS choice, OSG has done scalability testing of the BDII. We find that the service is reliable up to a query rate of 10 Hz. The OSG GOC has deployed monitoring of the query rate of the production BDII in response to this finding, in order to understand the operational risk implied by this single point of failure. The monitoring data of the current year showed a stable query rate of about 2Hz, so while we will continue to monitor the BDII, we do not expect any additional effort to be needed.

In addition, we have continued to work on a number of lower priority objectives:

  1. Globus has started working on a new release of GRAM, called GRAM5. OSG is committed to work with them on integration and large scale testing. We have been involved in the validation of alpha and beta releases, providing feedback to Globus ahead of the official release, in order to improve the quality of the released software. This also involved working with Condor on the client side, since all OSG VOs rely on Condor for submission to GRAM-based CEs. GRAM5 is now supported in Condor version 7.3.2 and above. We plan to continue integration and scalability testing after the release of GRAM5 in order to make it a fully supported CE in the OSG software stack.

  2. WLCG has expressed the interest in deploying CREAM on WLCG sites in Europe and Asia. CREAM is a CE developed by INFN in EGEE. Since both ATLAS and CMS use Condor as the client to Grid resources and run jobs on WLCG resources worldwide, OSG has agreed to work with Condor to integrate CREAM support. The OSG activity involved testing of Condor development releases against a test CREAM instance provided by INFN. As a result, CREAM is now supported in Condor version 7.3.2 and above. We are now in the process of running scalability tests against CREAM, by deploying a CREAM instance at UCSD and interfacing it to the test cluster of 2000 batch slots. This will allow us to both validate the Condor client, as well as to evaluate CREAM as a possible CE candidate for OSG. We expect this work to be completed in early 2010. We are also waiting for CREAM deployment on the production infrastructure on EGEE to allow for realistic large scale testing.

  3. WLCG also uses the ARC CE; ARC is the middleware developed at and deployed on NorduGrid. We have been involved in testing the Condor client against ARC resources, which resulted in several improvements in the Condor code. For the next year, we plan to deploy an ARC CE at an OSG site and perform large scale scalability tests, with the aim to both validate the Condor client, as well as to evaluate ARC as a possible CE candidate for OSG.

  4. CMS has approached OSG for help in understanding of the I/O characteristics of CMS analysis jobs. We helped by providing advice and expertise which resulted in much improved CPU efficiencies of CMS software on OSG sites. CMS is continuing this work by testing all of their sites worldwide. OSG will continue to be involved in these tests at the level of consulting, and in order to learn from it, and document it for a wider audience.

  5. In the area of usability, an “operations toolkit” for dCache was improved. The intent of the toolkit is to provide a “clearing house” of operations tools that have been developed at experienced dCache installations, and derive from that experience a set of tools for all dCache installations supported by OSG. This is significantly decreasing the cost of operations and has lowered the threshold of entry. Site administrators from both the US and Europe continue to upload tools, and resulting in several releases. These releases have been downloaded by a number of sites, and are in regular use across the US, as well as some European sites.

  6. Another usability milestone was the creation of a package integrating BeStMan and Hadoop. By providing integration and support for this package, we expect to greatly reduce the effort needed by OSG sites to deploy it; CMS Tier-2s and Tier-3s in particular have expressed interest in using it. In the next year we plan to further improve it based on feedback received from the early adopters.

  7. A package containing a set of procedures that allow us to automate scalability and robustness tests of a Compute Element has been created. The intent is to be able to quickly “certify” the performance characteristics of new middleware, a new site, or deployment on new hardware. The package has already been used to test GRAM5 and CREAM. Moreover, we want to offer this as a service to our resource providers so that they can assess the performance of their deployed or soon to be deployed infrastructure.

  8. Work has started on putting together a framework for using Grid resources for performing consistent scalability tests against centralized services, like CEs and SRMs. The intent is similar to the previous package, to quickly “certify” the performance characteristics of new middleware, a new site, or deployment on new hardware, but using hundreds of clients instead of one. Using Grid resources allows us to achieve this, but requires additional synchronization mechanisms to be performed in a reliable and repeatable manner.


Download 337.55 Kb.

Share with your friends:
1   ...   10   11   12   13   14   15   16   17   ...   20




The database is protected by copyright ©ininet.org 2024
send message

    Main page