7.5Installation Plan (TR-2)
Offeror will provide site installation instructions to LLNS delineating all site preparation work necessary to install and operate the systems, as configured in the subcontract. These instructions may delineate the type of electrical equipment required for installation (power couplings and placement, floor loading, etc.). This information will be delivered to LLNS within 30 days of receipt of subcontract for Dawn and within six months of receipt of contract for the technology refresh, if applicable, and Sequoia systems.
End of Section 7
8.0Project Management
Achieving petascale performance for Dawn and Sequoia is an extremely daunting task. In order to be successful, the selected Offeror / LLNS partnership will need to focus efforts in three major areas: 1) scalability; 2) scalability; and 3) scalability. Hardware scalability is key to minimizing power consumption, RAS and application performance. Software scalability is key to system management and RAS and the ability of applications to be able to efficiently utilize O(1.5-3.0M) cores/threads. Integrated system scalability is the key to system usability, power consumption, system physical size and RAS. The challenges the Tri-Laboratory community faces in providing a platform on the scale of Dawn and Sequoia to meet the DOE Stockpile Stewardship programmatic requirements are no less so. Moreover, these challenges are not only technical, but also manifest themselves in the management and administration of the project. All have substantial impact regarding risk and therefore the probability of project success. LLNS recognizes that, ultimately, the selected Offeror is responsible for the successful integration of all the elements, including those acquired from third-parties, academia, and other ASC-related efforts, to provide the petascale computing environment needed to meet the national goals of the Stockpile Stewardship Program. LLNS, NNSA and the selected Offeror must recognize this acquisition as a primary institutional commitment. LLNS expects its partners to successfully meet this commitment.
The experience gained by Lawrence Livermore, Los Alamos, and Sandia in the installation of the first six generations of ASC Platforms: Red, Blue, White, Q, Purple (and BlueGene/L) and RedStorm, systems has demonstrated that such an activity taxes the resources and management capabilities of even the largest and best-managed organizations.
Some of the lessons collectively learned in fielding and integrating those systems into a useful scientific simulation environment include the following:
The most important lesson learned is that this effort, if it is to succeed, must truly be a “partnership” among all involved. While careful mutual planning on the part of LLNS and the selected Offeror is essential to meeting requirements, unforeseen events and changes are likely. These events can only be successfully dealt with by a partnership that goes beyond an ordinary vendor-customer relationship. It must be one in which teaming, mutual respect, and an honest desire to achieve success is present on the part of everyone involved.
Changes in a company’s technology roadmap can have significant consequences on the success of the project. Whether from development delay or fundamental changes in a company’s technology decisions, change is almost inevitable. It is therefore important that such changes be quickly evaluated for their impact on the project. In addition, strategies must be developed and discussed to mitigate technical and scheduling problems.
Component availability for system manufacture can affect delivery schedules. This is particularly true for new equipment, for which only a limited quantity of components is available. LLNS has also found this to be the case for some older components owing to the large volumes needed for a system of this size, as well as to the commitments lower-tier suppliers may have to other customers.
Manufacturing, assembly, and QA for a system of this size can tax even the largest companies. It is therefore important to ensure that sufficient capacity, without compromising quality assurance, is available at the times necessary to meet delivery schedules. In addition, the development and systems stress testing of software releases and patches is an on-going problem for ASC sized systems. This is due to the fact that these systems are usually the largest systems fielded by a vendor by a wide margin. Careful planning of software testing and releases must be done in order to cost effectively test software.
Significant resources are needed at the factory for pre-delivery staging and testing. LLNS has found it best to perform pre-delivery staging and testing of portions of the system prior to shipment to LLNL to minimize installation problems later. It reduces the number of “DOA” and infant–mortality component failures, helps to ensure correct hardware, software and firmware operation, and allows for execution of company and LLNS test programs.
LLNS has found that such activity requires the selected Offeror to provide resources at the factory in the form of floor space, ancillary equipment (i.e., disks, interconnects), and personnel.
Installation needs to proceed in a logical, coordinated manner. Systems that are shipped without disks, for example, are generally not useable and take up valuable installation resources and floor space. This problem also speaks to the need for outstanding coordination among all elements within a company to ensure that hardware and software availability be coordinated (i.e., when new hardware is available the software to drive it is also available).
Shipment logistics have an impact. LLNS expects to have only approximately 1,500 square feet adjacent to the computer building loading dock to stage deliveries prior to installation in the computer room. This limitation should be taken into account when formulating delivery plans.
It is therefore important to quickly install each shipment as it arrives. Arrangements may be necessary to ensure that sufficient personnel with the appropriate training are available for installation, as well as factory-based resources to assist as needed if problem escalation is warranted. Again, the sheer size and complexity of this system may require that extraordinary measures be taken by the selected Offeror. Successful installation test completion will be required prior to the initiation of any acceptance test.
Acceptance testing is an extension of earlier testing. Although the pre-delivery and installation tests will identify many problems prior to acceptance testing, it has been LLNS’ experience that new problems may surface. The availability of on-site and factory-based resources to correct such problems is important.
Stabilizing the system as quickly as possible is programmatically important. It is imperative that the selected Offeror and those supplying third-party products work closely to that end. This arrangement will require that applications engineers as well as hardware and system software engineers be on-site to resolve problems. LLNS access to the system and third-party source code, although important in earlier and later stages, is critical at this point, and cannot be overemphasized. Again, the unprecedented size and complexity of this system dictate that resources over and above the norm will undoubtedly need to be brought to bear.
Post-stabilization resource requirements will also be significant. It is easy to underestimate the number of hardware engineers and software and applications analysts with the appropriate experience and training required to maintain high reliability and availability, to make the best use of the system, and to resolve problems quickly.
It is also easy to underestimate the extent of the necessary spare parts inventory. Because of the classified environment, use of remote diagnostic procedures will not be allowed.
Because of the complexity of this activity, a very strong project plan is of great importance. The Offeror’s understanding of LLNS’ requirements, approach to meeting those requirements, commitment of resources, and attention to cost are critical to the success of the project. In the same vein, the approach to managing this activity is critical. The need to have the support of corporate senior management and a major commitment to a quality assurance plan are also examples of areas critical to the success of the project.
The specific detailed planning and effort tracking and documentation requirements for the development and manufacturing efforts that will be delivered as part of the subcontract(s) are delineated in sections 8.2, Detailed Sequoia Plan Of Record
The specific target delivery milestones for the project are delineated in Section 8.3, Project Milestones.
Share with your friends: |