Joint High Performance Computing Exchange (jhpce) Johns Hopkins School of Public Health



Download 11.48 Kb.
Date conversion31.01.2017
Size11.48 Kb.
Joint High Performance Computing Exchange (JHPCE)

Johns Hopkins School of Public Health

Dept. of Biostatistics, Dept. of Molecular Microbiology & Immunology

Updated: August 2016
Overview

The Joint High Performance Computing Exchange (JHPCE) is a fee-for-service HPC. It supports 400 active users from the School of Public Health and the School of Medicine. The resources are optimized for Life Science research (e.g. genomics, biomedical imaging and statistical computing).

The primary resources of the facility are: 1) a heterogeneous 80-node computing cluster with 3024 64-bit cores, 21.3 TB of DDR-SDRAM, 2) a 4PB Lustre storage cluster, 3) 1PB of ZFS NAS storage for home and archive and 4) 1PB of ZFS NAS storage for disk-to-disk backup.

A parallel file system (Lustre) is used as the main project storage space and for scratch storage. The Lustre-over-ZFS-on-Linux parallel file system a custom design and to our knowledge it is the lowest-cost and lowest-power (on a per-TB basis) parallel file system that has ever been constructed. We hold workshops to train other universities how to build and manage similar systems (see e.g. https://www.disruptiveStorage.org).

The compute cluster network fabric consists of 10Gbps Ethernet connected to a 40Gbps Mellanox core switch. The storage cluster network fabric consists of 40Gbps Ethernet connected through the core switch.

Shell access (ssh) is through two redundant login nodes. File transfer is via two transfer nodes with 40Gbps access to the JHU Science DMZ and the JHU internal “Research Network”.

Jobs are submitted via the Open Grid Engine (OGE) scheduler. Dedicated queues are used to provided stakeholder research groups with priority access to sub-resources, e.g. specialized hardware or licenses purchased by stakeholders.

Major statistical and mathematical computing packages are available, including R, SAS, STATA, Matlab, and Mathematica. In addition, there exists a broad range of domain-specific application software (genomics, proteomics, epidemiology, medical imaging, machine learning, etc.) that is maintained and shared by the community. Community support and knowledge sharing is enabled via an active listserv. Community application “experts” are given access to a special partition, where they can install and maintain software (e.g. R, genomics tools, etc.) for use by the entire community.



Backup

ZFS snapshots are performed nightly on our ZFS systems. Remote disk-to-disk backup is performed nightly. Home directories are backed up by default. All other data is backed up upon request.



Data Center & Satellite Server Room

The bulk of the JHPCE facility is located at the Maryland Advanced Research Computing Center (MARCC) colocation facility on the JHU Bayview campus, where we occupy 9 racks. The datacenter houses our clusters as well as the MARCC ‘Bluecrab” cluster. There is ample room for expansion. The entire facility currently has a 1.5MW electrical capacity with the ability to scale to 10MW. We also have a 250sqft satellite server room on the 3rd floor of the School of Public Health Wolfe St. building.



Security and HIPAA compliance

The JHPCE system can only be accessed via secure shell (ssh) protocol. Either two-factor authentication or public/private key-pair authentication is required. The login nodes are behind the JHU institutional firewall. ssh login is not allowed on the transfer nodes. Traffic to ports on the transfer nodes is controlled via ACLs on the University’s core switches. An Intrusion Detection System (IDS) monitors the network.

Access logs of user logins are maintained and can be examined retrospectively. Multiple failed login attempts result in having the offending IP address blocked. We have a strict policy of one-person/one-user. No sharing of user accounts is permitted. Violations of this policy can result in a user being banned from the system. Either ACLs or unix user and group permissions are used to enforce data access policies (depending on the storage system). Access permissions are determined by project PIs in consultation with the JHPCE staff. Both the Director (F. Pineda) and the Technology Manager (M. Miller) have taken executive training in CyberSecurity.

Management and System Administration

The facility is a Johns Hopkins University service center in the department of Biostatistics. It is run by a multi-departmental team led by Dr. Fernando Pineda (MMI), who provides technical, financial and strategic leadership. Mr. Mark Miller (MMI) is the Cluster Technology manager and lead system administrator. Mr. Jiong Yang (Biostat) is the systems engineer. Ms. Debra Moffit (Biostat) provides financial oversight. The Biostat Information Technology (BIT) committee provides advice and oversight and is chaired by Dr. Pineda.



Policies and Cost recovery

The facility is operated as a formal Common Pool Resource (CPR) with a software-enabled governance/business model consistent with best practices for CPRs [1]. Since 2007 we have employed a systematic resource and cost sharing methodology that has largely eliminated the administrative and political complexity associated with sharing of complex and dynamic resources. Charges are calculated monthly, but billed quarterly. Eight years of operation has demonstrated that it is a powerful approach for fair and efficient allocation of human, computational and storage resources. In addition, the methodology provides strong financial incentives for stakeholders to join the CPR, to behave well, and to refresh their hardware.



Computing and storage resources are owned by stakeholders (e.g. departments, institutes or individual labs). Job scheduling policies guarantee that stakeholders receive priority access to the computing resources that they own. Stakeholders are required to make their excess capacity available to other users through a low-priority queue. Stakeholders receive a reduction in their charges in proportion to the capacity that they share with other users. This strategy provides a number of advantages: 1) stakeholders see a stable upper-limit on the operating cost of their own resources, 2) stakeholders can buy surge capacity from other stakeholders and 3) non-stakeholders obtain access to high-performance computing on a low-priority basis with a pay-as-you-go model.
[1] Elinor Ostrom, “Beyond Markets and States: Polycentric Governance of Complex Economic Systems”,
American Economic Review 100 (June 2010): 1–


The database is protected by copyright ©ininet.org 2016
send message

    Main page