Draft statement of work


Distributed Computing Middleware



Download 0.66 Mb.
Page15/34
Date28.01.2017
Size0.66 Mb.
#9693
1   ...   11   12   13   14   15   16   17   18   ...   34

3.3Distributed Computing Middleware

3.3.1Kerberos (TR-1)


Offeror may provide the Massachusetts Institute of Technology (MIT) Kerberos V5 reference implementation, Release1.6 or later, client software on the proposed system. This may include a fully supported integrated login mechanism, including a Kerberos V5 PAM module, that may support the use of password authentication and Kerberos V5 ticket authentication against a MIT Kerberos authentication server. Support for Public Key Cryptography for the initial authentication in Kerberos (PKINIT) may also be provided as a PAM module. This mechanism may comply with the authorization policies established for MIT Kerberos principals (e.g., password lifetime, account lockout) and may be capable of acquiring and storing the user's Kerberos credentials for a login session. This mechanism, supplied by the Offeror, along with the following login utilities and daemons- rsh, rcp, rlogin, ssh, scp, sftp and ftp, may be fully interoperable with the login utilities and daemons in Kerberos V5 Release 1.6, or later, as distributed by MIT or in OpenSSH V2 distributions for ssh, scp, sftp.

3.3.2LDAP Client (TR-1)


Offeror may provide LDAP version 3 (or then current) client software on each ION, SN and LN in the proposed system. This may include the use of SASL/GSSAPI for authentication and SSL for integrity and privacy with support for, but may not be limited to, Kerberos V5 as a security mechanism. The supplied command-line utilties- ldapsearch, ldapmodify, ldapdelete, ldapadd and client libraries may be fully interoperable with an OpenLDAP Release 2.4 or later LDAP server. The Offeror may provide directory service integration software to enable UNIX C library calls that perform user and group queries (e.g., getpwnam(), getpwuid(), getpwent(), getgrnam(), getgruid(), getgrent()) to obtain this information from an LDAP directory via a caching daemon and client service library. This may include support for IETF RFC 2307 (http://www.ietf.org/rfc/rfc2307.txt) and support a mechanism to map object classes and attributes as well as perform rules-based data transformations and filtering on attribute values or sets of attribute values.

3.3.3NFSv4.1 Client (TR-1)


Offeror may provide NFS version 4.1 (http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-minorversion1-21.txt) client software with the BOS. This may include the use of RPCSEC_GSS for authentication, integrity and privacy with support for, but should not be limited to, Kerberos V5 as a security mechanism. This may include a fully supported NFSv4.1 ACL mechanism with ACL editing utilities. The ACL mechanism may provide support for users and groups in a multi-domain environment (i.e., recognize identity@domain identifiers). Offeror may support mapping NFSv4 name identifiers to UNIX UID and GID values. This may include providing directory service integration software for obtaining this information from an LDAP directory via a caching daemon and a client service library.

3.3.4Cluster Wide Service Security (TR-1)


All system services including debugging, performance monitoring, event tracing, resource management and control may be performed using a secure authentication and authorization protocol that interfaces to the PAM (Section 3.1.6). This protocol may be efficiently and scalably implemented so that authentication and authorization step for any size job launch is less than 50% of the total job launch time.

3.4System Resource Management (SRM) (TR-1)


The overall System Resource Management (SRM) requirement is to integrate Sequoia into the existing LLNS SCF enterprise-wide job management system based upon Moab6 and SLURM7. Moab is a highly configurable policy-based intelligence engine that integrates scheduling, managing, monitoring and reporting of cluster workloads across multiple computers and sites. Moab relies upon resource managers on the individual clusters to manage the cluster's resources and jobs. SLURM is a highly scalable open source resource manager in use on hundreds of the largest computers in the world. It provides three key functions within an individual clusters. First, it allocates resources to users for some duration of time so they can perform work. Second, it provides a framework for initiating and managing work, typically a parallel job, on the set of allocated resources. Finally, it arbitrates conflicting resources by managing a queue of pending work.

In order for SLURM to provide resource management on Sequoia interface requirements between SLURM and Offeror’s proposed software are detailed below.


3.4.1SRM Security (TR-1)


SRM components and communications between components must be secure: users can only see and manipulate their applications and data and SRM components may not run as the “root” user account. User identities may be maintained throughout the chain of SRM components without giving users login capability directly on ION or SN BOS.

3.4.2SRM API Requirements (TR-1)


Offeror’s proposed APIs may not write to STDOUT/STDERR, but may provide documented status codes. Offeror’s APIs may be reliable so that they complete successfully or return correct error codes with no more than one in ten thousand (1 in 1x104) attempts failing. The APIs may be thread-safe. The APIs speed is important and all APIs may return within 10 milliseconds (polling or event triggers can be used if more time is required to complete any API function). All APIs may be usable from an I/O or service node. No SLURM damon may execute on the compute nodes. Documentation may be provided for the APIs.

3.4.3Node Reboot API (TR-1)


Offeror may provide APIs to reboot CN and ION. The API may provide the ability to reboot individual or groups of ION and the corresponding CN. The API may provide the ability to reboot individual or groups of ION without requiring the reboot of the corresponding CN. The APIs may control the LWK image to be loaded on the CN. Use of this API may be restricted so that normal users cannot write programs that use the services provided by the API.

3.4.3.1Node “RAM Disk” API (TR-2)


Offeror may provide API that allows configuration of LWK “RAM Disk” as specified in Section 3.2.11, if bid. This API may allow turning the “RAM Disk” feature on and off and if it is turned on the specification of “RAM Disk” size and mount point. This API may also allow clearing of the “RAM Disk.”

3.4.4Network Topology API (TR-1)


Offeror may provide APIs to determine the network topology connecting CN. This information will be used by SLURM in order to determine optimal resource allocations for pending jobs.

3.4.5Job Manipulation Commands and API (TR-1)


Linux command line utilities and APIs may be available to reliably manipulate a job as a single entity: including kill, modify, query characteristics, and query state. Offeror’s commands and API may be reliable so that they complete successfully with all tasks of the job being having been correctly manipulated by the command or API or return correct error codes with no more than one in ten thousand (1 in 1x104) calls to the API failing or failing to correctly manipulate all tasks in the job.

3.4.6Job Signaling API (TR-1)


Offeror may provide APIs to send an arbitrary signal to SLURM specified individual or groups of user tasks. Signal delivery may be reliable so that every task in the SLURM specified group of user tasks receives the signal and executes the correct signal handler with the nominal results with failure less than 1 in 1.0x104 calls to the API. In particular, SIGKILL may reliably terminate any task. Use of the API may be restricted to prevent a user from signaling another user's job.

3.4.7User Task Launch API (TR-1)


Offeror may provide APIs to launch user tasks on CN. The APIs may provide the capability of executing different applications with different arguments for each task. The APIs may provide the capability of launching specific tasks on SLURM specified core(s) within a node and binding tasks to SLURM specified core(s). Job launch time may vary by no more than the log of the task count. Job sizes up to one task per CN core will be supported. Launching tasks into a stopped state may be supported for debugging. Use of the API must be restricted so that users may not have the ability to launch tasks on resources that have not been allocated to them. Task launch time for jobs with small binaries may not exceed 3 seconds for 8,192 tasks.

3.4.8User Task Connectivity API (TR-1)


Offeror may provide APIs to establish connectivity for application programs to make use of the CN interconnect for their communications. Use of the API may be restricted so that users must not have the ability gain access to other jobs' communications.

3.4.9SRM STDIO (TR-1)


Offeror may provide SRM APIs that allow SLURM to distinguish during job launch between and identify STDOUT and STDERR for each user MPI task in a job. This API may allow SLURM to send data to STDIN of each user MPI task in a job.

3.4.10System Initiated Checkpoint API (TR-3)


Offeror may provide APIs to checkpoint a parallel job. The API may provide support for creating a checkpoint and either continuing execution or terminating. Use of the API may be restricted to prevent a user from checkpointing another user's job. Offeror may provide APIs another mechanism of restarting a previously checkpointed job. Use of the API and/or checkpoint file permissions may be restricted to prevent a user from restarting another user's job.

3.4.11Predicting Failed Nodes (TR-2)


Offer may provide an API to provide a list of CN and ION that are predicted to fail within the SLURM specified period of time. This facility will be used by SLURM to drain CN from the available pool and prevent queued jobs from running on them until CN or ION was repaired.

Download 0.66 Mb.

Share with your friends:
1   ...   11   12   13   14   15   16   17   18   ...   34




The database is protected by copyright ©ininet.org 2024
send message

    Main page