This section describes the overall TLCC2 Scalable Unit (SU) strategy and architecture.
TLCC2 Strategy
As described above, the Tri-Laboratory ASC Program community requires a large amount of capacity computing resources over the next Government fiscal year. In order to affordably and efficiently provide this Production quality computing capacity, the TLCC2 technical committee has embarked on an approach that extends existing practices within the Tri-Laboratory community while significantly improving TCO. During market survey discussions with industry, the Tri-Laboratory identified the approach of highly replicated “scalable units” that can be easily built, shipped, sited at the receiving laboratory and accepted quickly. In addition, there is a balance between the number of SU’s deployed and amount of work to maintain a large number of separate clusters at each site. Thus, the Tri-Laboratory has the need to aggregate SU’s into clusters built with 1, 2, 4, 8 and 16 SUs. The strategy is to purchase (under the subcontract resulting from this RFP) all of the components to build cluster SU aggregations and second stage InfiniBand Architecture (IBA) 4x QDR switches.
The delivered SUs in some set of aggregations called clusters will be integrated into existing classified simulation environments at the receiving laboratory. As such, the SU need to integrate into the receiving Laboratory’s existing multi-cluster file system. For LLNL and SNL, that multi-cluster parallel file system is Lustre from Oracle (http://www.oracle.com) and for LANL that multi-cluster parallel file system is PanFS from Panasas (http://www.panasas.com/panfs.html).
By replicating the SU many times during the subcontract, the Tri-Laboratory intends to reduce the cost to produce, deliver, install and accept each SU. In addition, this approach will produce a common Linux cluster hardware environment for the Tri-Laboratory user and system administration communities and thus reduce the cost of supporting Linux clusters and programmatically required applications on those clusters. However, the Tri-Laboratory prefers the SU design to be flexible enough to accommodate the following technology improvements:
-
Processor frequency improvements within the same cost and power envelopes
-
New processor socket and/or chipset improvements
-
New processor cores
-
Disks with higher capacity
-
New memory speed and capacity improvements
-
Interconnect bandwidth and latency improvements
Due to the extremely attractive cost/performance of x86 based Linux clusters and large number of x86 based clusters fielded at all Tri-Laboratory sites and the need for at least 2 GB of memory per processor core, this procurement focuses on solutions that are binary compatible with AMD x86-64 and Intel EM64T, and have InfiniBand 4x QDR interconnect or better. Dual socket nodes with multi-core processor implementations are widely available. The resulting B:F ratio, measuring the interconnect bandwidth of node (B) to the peak 64b floating point arithmetic performance of the node (F), has acceptable balance for our applications with this technology choice.
In order to minimize TLCC2 cluster support costs and the time to migrate a TLCC2 cluster into classified Production status, the Tri-Laboratory community will supply the Linux cluster software for building, burn-in and accepting the SUs. A Digital Versatile Disk (DVD) will be provided containing the TLCC2 CCE software stack, configured for this purpose. This variant of the CCE software stack consists of a RedHat Enterprise Linux distribution that has been enhanced to support vendor supplied hardware, cluster system management tools required to install, manage and monitor the SU, and a Tri-Lab workload test suite. It is the intent of the Tri-Laboratory community to use the InfiniBand software stack provided within Red Hat Enterprise Linux (RHEL) for use on Production computing clusters. Additional InfiniBand functionality may be added from the Open Fabrics Alliance or the greater open source community as needs arise. More specific details of CCE are provided in Section below. The Tri-Lab workload test suite will be used as the SU burn-in, pre-ship test and then run again as a post-ship test after the SU is delivered and assembled at the receiving Laboratory. Once the SU is delivered, the Offeror and receiving Laboratory will be responsible for combining multiple SUs, as directed by the ASC Program user community, into clusters with the Offeror supplied spine switches and cables. Final acceptance of these clusters will be accomplished with a scaled up version of the pre/post-ship test.
The support model for TLCC2 is extremely simple. The Tri-Laboratory community will supply Level 1 and Level 2 support with the Offeror providing Level 3 support functions. On the software side, the Offeror is required to provide and support (level 3) device drivers and other low level specialized software for the provided hardware (including IBA). This support should be provided during normal working hours, Monday through Friday. On the hardware side, the Offeror is required to provide an on-site parts cache of Field Replaceable Units (FRUs) sufficient to cover a at least one week worth of failures without refresh and a Return Merchandise Authorization (RMA) mechanism for return of failed FRUs and refreshing the on-site parts cache. The Tri-Laboratory personnel at each site will provide Level 1 and Level 2 software and hardware support functions that includes responding to problem reports, root cause analysis, reading diagnostics and swapping FRUs. The Offeror shall provide training, on an as needed basis, for hardware FRU replacement over the lifetime of the support period of the subcontract. The Offeror will supply software maintenance for each Offeror supplied software component, starting with the first SU acceptance and ending three years after cluster acceptance.
Share with your friends: |