3.3Operations
The OSG Grid Operations Center (GOC) provides the central point for operational support for the Open Science Grid and provides the coordination for various distributed OSG services. The GOC performs real time monitoring of OSG resources, supports users, developers and system administrators, maintains critical information services, provides incident response, and acts as a communication hub. The primary goals of the OSG Operations group are: supporting and strengthening the autonomous OSG resources, building operational relationships with peering grids, providing reliable grid infrastructure services, ensuring timely action and tracking of operational issues, and assuring quick response to security incidents. In the last year, the GOC continued to provide the OSG with a reliable facility infrastructure while at the same time improving services to offer more robust tools to the OSG stakeholders.
During the last year, the GOC continued to provide and improve tools and services for the OSG:
-
The OSG Information Management (OIM) database that provides the definitive source of topology and administrative information about OSG user, resource, support agency, and virtual organizations was updated to allow new data to be provided to the WLCG based on installed capacity requirements for Tier 1 and Tier 2 resources. Additional WLCG issue tracking requirements we also added to insure preparedness for LHC data-taking.
-
The Resource and Service Validation (RSV) monitoring tool has received a new command line configuration tool plus several new tests, including security, accounting, and central service probes.
-
The BDII (Berkeley Database Information Index), which is critical to CMS production, is now functioning with an approved Service Level Agreement (SLA) which was reviewed and approved by the affected VOs and the OSG Executive Board.
-
The MyOSG information consolidation tool achieved production deployment and is now heavily used within the OSG for status information as well as being ported to peering grids such as EGEE and WLCG. MyOSG allows administrative, monitoring, information, validation and accounting services to be displayed within a single user defined interface.
-
The GOC-provided public trouble ticket viewingticket-viewing interface has received updates to improve usability. This interface allows issues to be tracked and updated by user and it also allows GOC personnel to use OIM meta-data to route tickets much more quickly, reducing the amount of time needed to look up contact information of resources and support agencies.
And we continued our efforts to improve service availability via the completion of several hardware and service upgrades:
-
The GOC services located at IU were moved to a new more robust physical environment, providing much more reliable power and network stability.
-
The BDII was thoroughly tested to be sure outage conditions on a single BDII server were quickly identified and corrected at the WLCG top-level BDII.
-
A migration to a virtual machine environment for many services is now complete to allow flexibility in providing high availability services.
-
Virtual machines for training were provided to GOC staff to help troubleshoot installation and job submission issues.
OSG Operations is ready to support the LHC re-start and continues to refine and improve its capabilities for these stakeholders. We have completed preparation for the stress of the LHC start-up on services by testing, by putting proper failover and load-balancing mechanisms in place, and by implementing administrative automation with several user services. Service reliability for GOC services has always been high and we now gather metrics that can show the reliability of these services exceed the requirements of Service Level Agreements (SLAs). SLAs have been finalized for the BDII and MyOSG and SLAs for the OSG software cache and RSV are being reviewed by stakeholders. Regular release schedules for all GOC services have been implemented to enhance user testing and regularity of software release cycles for GOC provided services. One additional staff member will be added to support the load increase in tickets expected with the LHC re-start. During 2009, the GOC and Operations team completed the transition from the building phase for both service infrastructure and user support into a period of stable operational service for the OSG community. Preparations completed, we are now focusing on providing a reliable, sustainable operational infrastructure and a well-trained, proactive support staff for OSG production activities.
3.4Integration and Site Coordination
The OSG Integration & Sites Coordination activity continues to play a central role in helping improve the quality of grid software releases prior to deployment on the OSG and in helping sites deploy and operate OSG services thereby achieving greater success in production.
The major focus was preparations for OSG 1.2, the golden release for the LHC restart. Beginning in April 2009, the 3-site Validation Test Bed (VTB) was used repeatedly to perform installation, configuration and functional testing of (pre-releases) of VDT across three sites. These tests shake out basic usability prior to release to the many-site, many-platform, many-VO integration test bed (ITB). In July, the deliverable of a well-tested OSG 1.2 suitable for LHC running was met through the combined efforts of both OSG core staff and the wider community of OSG site administrators and VO testers participating in the validation. The face-to-face site administrator’s workshop hosted by the OSG GOC in Indianapolis in July featured deployment of OSG 1.2 as its main theme.
The VTB and ITB test beds also serve as a platform for testing the main configuration tools of the OSG software stack, which are developed as part of this activity. The configuration system has grown to meet the requirements of the services offered in the OSG software stack including configuration of storage element software.
A new tool suite developed in the Software Tools Group for RSV (resource service validation) configuration and operation testing was vetted on the VTB repeatedly, with feedback given to the development team from the site administrator’s viewpoint.
The persistent ITB continues to provide an on-going operational environment for VO testing at will, as well as being the main platform during the ITB testing prior to a major OSG release. Overall the ITB is comprised of 12 sites providing compute elements and four sites providing storage elements (dCache and BeStMan packages implementing SRM v1.1 and v2.2 protocols); 36 validation processes were defined across these compute and storage resources in readiness for the production release. Pre-deployment validation of applications from 12 VOs were coordinated with the OSG Users group. Other accomplishments include both dCache and SRM-BeStMan storage element testing on the ITB; delivery of new releases of the configuration tool as discussed above; and testing of an Xrootd distributed storage system as delivered by the OSG Storage group.
The OSG Release Documentation continues to receive significant edits from the community of OSG participants. The collection of wiki-based documents capture processes for install, configure, and validation methods as used throughout the integration and deployment processes. These documents were updated and received review input from all corners of the OSG community.
In the past year we have taken a significant step towards increasing contact with the site administration community directly through use of a persistent chat room (“Campfire”). We offer three hour sessions at least three days a week where OSG core Integration or Sites support staff are available to discuss issues, troubleshoot problems or simply “chat” regarding OSG specific issues. The sessions are archived and searchable.
Finally a major new initiative meant to improve the effectiveness of OSG release validation has begun whereby a suite of test jobs can be executed through the (pilot-based) Panda workflow system. The test jobs can be of any type and flavor; the current set includes simple ‘hello-world’ jobs, jobs that are CPU-intensive, and jobs that exercise access to/from the associated storage element of the CE. Importantly, ITB site administrators are provided a command line tool they can use to inject jobs aimed for their site into the system, and then monitor the results using the full monitoring framework (pilot and Condor-G logs, job metadata, etc) for debugging and validation at the job-level. As time goes on additional workloads can be created an executed by the system, simulating components of VO workloads.
Share with your friends: |