The Sequoia baseline system performance shall be at least 20.0 petaFLOP/s (20.0x1015 floating point operations retired per second).
2.1.1Sequoia System Performance (TR-1)
The Sequoia system performance may be at least . Where P is the peak of the system as defined in section 2.1 and S is the weighted figure of merit for five applications and is defined in Section 9.4.2.
2.2Sequoia Major System Components (TR-1)
Offeror’s proposed Sequoia system will include the following major system components (see Figure 1 -5): 1) the Compute Nodes (CN) and I/O Nodes (ION), the Login Nodes (LN), the Service Nodes (SN), and the management Ethernet. Not shown in the figure is the interconnect network(s) that provide high speed, low latency RDMA and MPI communications between the nodes in the system. The remaining components in Figure 1 -5, including the Storage Area Network (SAN), Lustre OSS and OSS resources will be supplied by LLNS and integrated with the proposed Sequoia system by the selected Offeror and LLNS in partnership.
Offeror’s technical proposal will include a concise description of the Sequoia system architecture that includes the following.
System architectural diagram showing all nodes, networks, external network connections and their proposed functions.
Detailed architectural diagram of each node type showing all major components (e.g., processor cores and their functional units, caches, memory, system interconnect interfaces, DMA engines, etc.) and data pathways along with latency and bandwidth to move data between them.
Detailed architectural diagram showing all management networking components, connections to Sequoia system, and connections to the front-end nodes.
Number of nodes required or recommended by the Offeror for system functions (e.g., cluster wide file system operation, switch operation and management, RAS and other system management systems, user login) may be indicated and clearly denoted as NOT part of the compute nodes.
Clearly indicate the known, anticipated and suspected I/O performance limiters and bottlenecks.
2.2.1IO Subsystem Architecture (TR-1)
The CN IO data path for file IO to the LLNS supplied Lustre file system may be between the CN to the ION over the system interconnect where the IO operations are handled by the Lustre client and then over the Offeror provided SAN interface to LLNS supplied SAN infrastructure to the LLNS supplied Lustre MDS and OSS. The CN IO data path for IP based communications to the LLNS SAN based IP devices may be between the CN to the ION over the system interconnect where the IP packets are routed to the Offeror provided SAN interface to LLNS supplied SAN infrastructure to the LLNS supplied IP based devices. The LN IO data path for both file IO to the Lustre file system and SAN based IP devices is over the external networking interfaces on the LN.
The Sequoia target architecture (see Figure 1 -5) provides for a static allocation of CN to ION. This provides for scalable IO bandwidth proportional to job size (number of CN and ION utilized by a job) with full system jobs running on 100% of the CN achieving at least 100% of the IO delivered bandwidth, half system jobs running on 50% of the CN achieving at least 50% of the full system IO delivered bandwidth and quarter system job running on 25% of the CN achieving at least 25% of the full system IO delivered bandwidth. This target architecture also allows for a distributed and scalable system software infrastructure by utilizing ION to perform some of the processing in parallel.
As a separately priced option specified in Section 2.12.1, Offeror may propose an enhanced IO subsystem that allows smaller jobs to achieve 2x of the IO file system bandwidth of the baseline system.
2.3Sequoia Component Scaling (TR-1)
In order to provide maximum flexibility to Offerors in meeting the goals of the ASC Program, the exact configuration of the Sequoia scalable system is not specified. Rather, the Sequoia configuration is given in terms of lower bounds on component attributes relative to the peak performance of the proposed configuration. The Sequoia scalable system configuration may meet or exceed the following parameters:
Memory Size (Byte:FLOP/s) 0.08
Memory Bandwidth (Byte/s:FLOP/s) 0.2
Node Interconnect Aggregate Link Bandwidth (Bytes/s:FLOP/s) 0.15
System Sustained SAN Bandwidth (GB/s:petaFLOP/s) 25.6
High Speed External Network Interfaces (GB/s:petaFLOP/s) 6.4
The foregoing parameters will be computed as follows:
Peak FLoating point OPeration per second (FLOP/s) rate computation: Maximum number of floating point arithmetic operations (chained multiply add counts as two) that can be completed per core cycle per compute node times the number of compute nodes in the system. Peak FP arithmetic operation per second rate is measured in petaFLOP/s = 1015 FLOP/s.
Memory Size computation: Number of bytes of main memory directly addressable with a single LOAD/STORE instruction (but not caches nor ROM nor EPROM) of each compute node times the number of compute nodes in the system. Memory is measured in petiByte (PiB) = 250 Bytes.
Memory Bandwidth/Peak FP Instructions (Byte/s:FLOP/s) computation: maximum number of bytes per second that some or all of the cores in a node can simultaneously move between main memory and processor registers (node memory bandwidth) in the compute nodes times the total number of compute nodes in the system divided by the peak FLOP/s of the system.
Node Interconnect Aggregate Link Bandwidth computation: Intra-cluster network link bandwidth is peak speed at which user data can be moved bi-directionally to/from a compute node over a single active network link. It is calculated by taking the MHz rating of the link time the width in bytes of that link minus the overhead associated with link error protection and addressing. The node interconnect aggregate link bandwidth is the sum over all active compute node links in the system of the node interconnect link bandwidths. Passive standby network interfaces and links for failover may not be counted.
Node Interconnect Minimum Bi-Section Bandwidth computation: A bi-section of the system is any division of the compute nodes that evenly divides the total system into two equal partitions. A bi-section bandwidth is the peak number of user payload bytes per second that could be moved bi-directionally across the high speed interconnect network between compute nodes summed over each compute node in one partition communicating to one other compute node in the other partition. The Node Interconnect Network Minimum Bi-Section Bandwidth is the minimum over all bi-sections of the bi-section bandwidths.
System Sustained SAN Bandwidth (GByte/s:petaFLOP/s) computation: The system sustained filesystem bandwidth is the measured rate an application can read or write data to/from LLNS supplied Lustre filesystem from all CN through the ION and LLNS supplied SAN to the Lustre OSS. Note that the SAN connects to the Sequoia ION. The methodology for measuring this metric is specified in Section 2.9.1. Note that Section 2.12.1 enhances this SAN bandwidth requirement with a separately priced Technical Option that configures the Sequoia system to deliver 2x this bandwidth to applications running on 50% and 25% subdivisions of the system.
High Speed External Network Interfaces (GB/s:petaFLOP/s) computation: The high speed external network interface link bandwidth (in GB/s) is the HW rated link uni-directional bandwidth. This is the data rate, so it is 4.0 GB/s for IniniBand 4x QDR and 1.25 GB/s for 10GbE (IEEE 802.3ba) Ethernet. The cluster high speed external network interfaces bandwidth is the sum of over all the external network interface link bandwidths. Note that the External Network connects to the Sequoia LN.
Example: For a 20.0 petaFLOP/s peak system, Section 2.3 specifies that the system may have at least 1.6 PiB of memory, 4.0 PB/s memory bandwidth, 2.0 PB/s node interconnect network aggregate link bandwidth, 50 TB/s intra-cluster networking bi-sectional bandwidth, and 512 GB/s system sustained SAN bandwidth and 128 GB/s peak external networking bandwidth.