In Place Node Service (TR-1)
The SU nodes will be serviceable from within the rack. The node will be mechanically designed to have minimal tool requirements for disassembly. The nodes and other rack components will be mechanically designed so that complete node other rack component disassembly and reassembly can be accomplished in less than 20 minutes by a trained technician without having to move the rack. Blade solutions will have hot swappable blades: the blade chassis will not require being powered down during a blade replacement repair action.
Component Labeling (TR-1)
Every rack, Ethernet switch, Ethernet cable, IBA switch, IBA cable, node, disk enclosure will be clearly labeled with a unique identifier visible from the front of the rack and/or the rear of the rack, as appropriate, when the rack door is open. These labels will be high quality so that they do not fall off, fade, disintegrate, or otherwise become unusable or unreadable during the lifetime of the cluster. The font will be non-serif such as Arial or Courier with font size for these labels at least 9pt. Nodes will be labeled from the rear with a unique serial number for inventory tracking. It is desirable that motherboards also have a unique serial number for inventory tracking. This serial number needs to be visible without having to disassemble the node, or else it will be queryable, either from Linux or the BIOS from a Linux command line tool.
Field Replaceable Unit (FRU) Diagnostics (TR-2)
Diagnostics will be provided that isolate a failure of a TLCC2 SU component to a single FRU for the supplied hardware. These diagnostics will run from the Linux command line. The diagnostic information will be accessible to operators through networked workstations.
Node Diagnostics Suite (TR-1)
The Offeror will provide a set of hardware diagnostic programs (a diagnostic suite or diagnostics) for each type of node provided that run from the Linux command line and produce output to STDERR or STDOUT and exit with an appropriate error code when errors are detected. These diagnostics will be capable of stressing the node motherboard components such as processors, chip set, memory hierarchy, on-board networking (e.g., management network), peripheral buses, and local disk drives. The diagnostics will stress the memory hierarchy to generate single and double bit memory errors. The diagnostics will read the hardware single and double bit memory error counters and reset the counts to zero. The diagnostics will stress the local disk in a nondestructive test to generate correctable and uncorrectable read and/or write errors. The diagnostics will read the hardware and/or Linux recoverable I/O error counters and reset the counts to zero. The diagnostics will stress the integer and floating point units in specific core(s) in serial (i.e., one processor and/or HyperThread, as appropriate, at a time) or in parallel (i.e., multiple processors and/or multiple HyperThreads, as appropriate). The CPU stress tests will bind to a specific core and/or HyperThread, as appropriate, by command line option, if possible.
Memory Diagnostics (TR-1)
The Linux OS will interface to the SU node hardware memory error facility specified in Section to log all correctable and uncorrectable memory errors on each memory FRU. When the node experiences an uncorrectable memory error, the Linux kernel will report the failing memory FRU and panic the node. The Offeror will provide a Linux utility that can scan the nodes and report correctable and uncorrectable memory errors at the FRU level indicating the exact DIMM location on the motherboard where the failing or failed DIMM is located and reset the counters.
IBA Diagnostics (TR-1)
The Offeror will provide a set of IBA hardware diagnostic programs (a diagnostic suite or diagnostics) for IBA components provided that run from the Linux command line and produce output to STDERR or STDOUT and exit with an appropriate error code when errors are detected. These diagnostics will be able to diagnose failures with IBA HCA, cables, and switch hardware down to individual FRU. These diagnostics will run and correctly diagnose failed and intermittently failing hardware within four hours and find all failed or intermittently failing components. In addition, these IBA diagnostics will be able to detect slow portions or portions with high bad packet rates of the interconnect infrastructure. IBA diagnostics mentioned above shall be accessible by open source tools such as those provided by the “InfiniBand-diags” open source package.
IPMI Based diagnostics (TR-1)
If the Offeror proposes IPMI to perform node diagnostics, such as through IPMI sensors, IPMI field replaceable unit (FRU) records, and IPMI system event log (SEL) entries, node diagnostics shall work with FreeIPMI's ipmi-sensors, ipmi-fru, and ipmi-sel respectively. Offeror will publicly release documentation on any OEM motherboard sensors, OEM FRU records, and OEM SEL event data so that they can be interpreted correctly and added for public release in FreeIPMI.
Peripheral Component Diagnostics (TR-2)
The Offeror will provide a set of hardware diagnostic programs (a diagnostic suite or diagnostics) for each type of peripheral component provided that run from the Linux command line and produce output to STDERR or STDOUT and exit with an appropriate error code when errors are detected. At a minimum, the diagnostics will test the 1000BaseT and other provided networking (e.g., Fibre Channel) adapters, RAID device and disks.
Scalable System Monitoring (TR-2)
There will be a scalable system monitoring capability for the TLCC2 SU supplied by the Tri-Laboratory Community that has a command line interface (CLI) for scriptable control and monitoring. This facility will directly interface to the Offeror provided node-monitoring facility through the ”Motherboard Sensors” Section facilities. The Tri-Laboratory Community will centrally collect all SU monitoring information at intervals set by the system administrator on one of the service nodes. All SU monitoring and diagnostics information provided by the Offeror will be formatted so that “expect scripts” can detect failures. In addition, this facility will be used by the Tri-Laboratory Community to launch diagnostics in parallel over the management network across all or part of the cluster, as directed by a system administrator from the Linux command line on one of the service nodes.
Share with your friends: |