The following requirements apply to all Sequoia system node types except where superseded in subsequent sections.
2.4.1Node Architecture (TR-1)
The Shared memory Multi-Processor (SMP) nodes may be a set of processor cores sharing random access memory within the same memory address space. The cores may be connected via a high speed, extremely low latency mechanism to the set of hierarchical memory components. The memory hierarchy consists of at least processor registers, cache and memory. The cache may also be hierarchical. If there are multiple caches, they may be kept coherent automatically by the hardware. The main memory may be a Uniform Memory Access (UMA) architecture. The access mechanism to every memory element may be the same from every core. More specifically, all memory operations may be accomplished with load/store instructions issued by the core to move data to/from registers from/to the memory.
2.4.2Core Characteristics (TR-1)
Each node may be an aggregate of homogeneous general purpose computing cores consisting of high-speed instruction issue, arithmetic, logic units, and memory reference execution units integrated together with the necessary control circuitry and interprocessor communications mechanism(s) and caches. All functional units and data paths may be at least 64b data path plus error detecting and correcting codes. Virtual memory data pointers may be at least 64b with at least 42b physical addresses. Each may execute fixed and IEEE 754 floating-point arithmetic, logical, branching, index, and memory reference instructions. A 64-bit data word size may directly handle IEEE 754 floating-point numbers whose range is at least 10-305 to 10+305 and whose precision is at least 14 decimal digits. The cores and memory hierarchy may provide an appropriate mechanism for interprocessor communication, interrupt, and synchronization. The core may contain built in error detection and fault isolation for all core components and in particular for the floating-point units, all caches, TLB entries. All storage elements not limited to registers, caches, TLB entries, memory may be at a minimum SECDED protected.
2.4.3IEEE 754 32-Bit Floating Point Numbers (TR-3)
The cores may have the ability to operate on 32-bit IEEE 754 floating-point numbers whose range is at least 10-35 to 10+35 and whose precision is at least 6 decimal digits, for improved memory utilization and improved execution times.
2.4.4Inter Core Communication (TR-1)
The cores may provide sufficient atomic capabilities (e.g., test-and-set or load-and-clear) along with some atomic incrementing capabilities (e.g., test-and-add or fetch-and-increment/fetch-and-decrement) so that the usual higher level synchronizations (i.e., critical section, barrier, etc.) can be constructed. These capabilities may allow the construction of memory and execution synchronization that is extremely low latency (<70 core cycles in the contention free case). As the number of user threads can be large in a Sequoia node, special hardware mechanism may be provided that allows multiple threads to simultaneously register with a barrier/lock device with a total latency of less than 350 processor clocks for the maximal case of all hardware threads trying to register with the barrier on the same cycle. This corresponds to a latency of effectively less than 4 processor cycles per thread when utilizing this on-chip hardware barrier/lock mechanism. Hardware support may be provided to allow for DMA to be coherent with the local node memory. Additionally, these synchronization capabilities or their higher-level equivalents will be directly accessible from user programs.
The atomic instructions API overhead, in the absence of contention, may be less than or equal to one micro-second (1.0x10-6 seconds).
2.4.5Node Interconnect Interface (TR-2)
Each node may be configured with a high speed, low latency interconnect (section 2.8) interface. This interface may allows all cores in the system to simultaneously communicate synchronously or asynchronously with the high speed interconnect. The asynchronous communications mechanisms may employ a DMA engine or equivalent that does not require the core to physically move the data. This interface may be capable of delivering either full memory bandwidth to the interconnect or all interconnect off node links simultaneously, whichever is less.
2.4.6Hardware Support for Low Overhead Threads (TR-1)
The nodes may be configured with hardware mechanisms for spawning, controlling and terminating low overhead computational threads. This published and well documented hardware thread interface support may include a low overhead locking mechanism and a highly efficient fetch and increment operation for memory consistency among the threads. Offeror supplied OpenMP and POSIX thread implementations for all provided compilers may use these hardware mechanisms to implement a highly efficient programming models for node parallelism. Offeror may fully describe this hardware facility and limitations and the potential benefit to ASC applications for exploiting OpenMP and POSIX threads node parallelism within MPI tasks.
2.4.7Hardware Support for Innovative node Programming Models (TR-2)
The Sequoia nodes may be configured with hardware support for innovative node programming models such as Speculative Execution (SE) or Transactional Memory (TM) that allow the automatic extraction and execution of parallel work items where sequential execution consistency is guaranteed by the hardware, not the programmer nor compiler. These facilities may allow the correct execution of multiple work items that have infrequent load/store conflicts. These hardware facilities may allow ASC applications to utilize all SMP programming techniques (e.g., OpenMP, POSIX Threads, SE or TM) within a single application, with the restriction that within a given subroutine only one style of node parallel programming will be active within a thread at a time. These hardware facilities and the Low Overhead Threads may be combinable allowing the programming models to be nested within an application call stack. Offer may fully describe these hardware facilities and limitations and the potential benefit to ASC applications for exploiting innovative programming models for node parallelism within MPI tasks.
2.4.8Programmable Clock (TR-2)
There may be a real-time clock per core capable of causing a hardware interrupt after a preset interval (i.e., a programmable clock). The clock frequency may be at least one megahertz and the preset interval may be capable of being set in increments of 10 microseconds or less. There may be at least 16 seconds allocated for the time interval. This clock may have at least 24 bits.
2.4.9Hardware Interrupt (TR-2)
The nodes may have hardware support for interrupting given subsets of cores based on conditions noted by the operating system or by other cores within the subset executing the same user application.
2.4.10Hardware Performance Monitors (TR-1)
The cores may have hardware support for monitoring system performance. This published and well documented hardware performance monitor (HPM) interface will be capable of separately counting hardware events generated by every thread executing on every core in the node. This HPM may count at least the following: instructions (FP/INT/BR) per cycle with or without loads/stores; cache hits and misses and prefetches for all levels of the data and instruction cache hierarchy; TLB misses for all levels of the data and instruction cache hierarchy; branch mis-predictions; snoop requests; snoop hits; load miss penalty in cycles; number of pipeline flushing operations (e.g. sync). Countable events from the Floating Point Unit or Units (FPU) may include floating point scalar and SIMD (or vector) add/subtract, mult, fused mult add, divide, double load, quad load, double store and quad store events. The available FPU events may be completely inclusive of all FPU activity, and may allow accurate calculation of total floating point performance (FLIN/s and FLOP/s) of a core when the counters are configured to count the floating point events for that core. In addition, the node interconnect interfaces may have hardware support for monitoring message passing performance on all proposed networks. If hardware support for LOT (section 2.4.5) or SE/TM (section 2.4.7) are proposed, then the HPM may count relevant events to determine parallel programming (in)efficiencies. This HPM may have 64b counters and the ability to notify the node OS of counter wrapping. This data will be made available directly to applications programmers and to code development tools (Section 3.7.8).
2.4.11Hardware Debugging Support (TR-1)
The cores may have hardware support for debugging of user applications, and in particular, hardware that enables setting regular data watch points (e.g., debug registers and hardware interrupts on read/write to a specific virtual memory location) and break points as well as fast versions of them via fast trap mechanisms (e.g., fast trap instructions that allow the application to trap into an exception handler without having to notify the debugging process). If hardware support for TM/SE (Section 2.4.7) is proposed, hardware support may also be proposed to allow tools to trace, debug, and analyze TM/SE threads in-depth (e.g., hardware support for fine-grain memory conflict detection). The Offeror may fully describe the hardware debugging facility and limitations. These hardware features may be made available directly to applications programmers in a published and supported API (Section 3.9.8) and utilized by the code development tools including the debugger (Sections 3.7.2, 18.104.22.168).
The Sequoia system nodes may be accessible over a out of band JTAG interface. This interface may be accessible over the service Ethernet network. This interface may allow system RAS and system administration functions an out of band (off the system interconnect) path.
2.4.13No Local Hard Disk (TR-1)
The Sequoia system nodes may be configured without a hard disk drive.
2.4.14Remote Manageability (TR-1)
All nodes may be 100% remotely manageable, and all routine administration tasks automatable in a manner that scales up to the full system size. In particular, all service operations under any circumstance on any node must be accomplished without the attachment of keyboard, video, monitor, and mouse (KVM). Areas of particular concern include remote console, remote power control, system monitoring, and node BIOS or firmware.
Offeror will fully describe all remote manageability features, protocols, APIs, utilities and management of all node types bid. Any available manuals (or URLs pointing to those manuals) describing node management procedures for each node type will be provided with the proposal.
All remote management protocols, including power control, console redirection, system monitoring, and management network functions must be simultaneously available via the system management Ethernet. Access to all hardware system functions within the nodes, must be made available at the OS level so as to enable complete system health verification.
22.214.171.124Remote Console and Power Management (TR-3)
Offeror may provide a hardware interface to the console port of every node in the system. The console interfaces will be aggregated on the Management Ethernet. Offeror may provide a scalable and reliable hardware interface to change the power state (up/down/reset) and querying the power state of every node or node aggregates, racks, etc., that use power. The provided power interfaces may be aggregated in a power control device on the Management Ethernet. The power control infrastructure may be able to reliably power up/down all nodes or groups of nodes in the system simultaneously. Reliable here means that 1,000,000 power state change commands may complete with at most one failure to actually change the power state of the target nodes.