A Light-Weight Communication System for a High Performance System Area Network
Amelia De Vivo
Prof. Alfredo De Santis Prof. Alberto Negro
The actual trend in parallel computing is building clusters with Commodity Off The Shelf (COTS) components. Because of standard communication adapters limits, System Area Networks (SAN) have been developed with the main purpose of supporting user-level communication systems. These eliminate operating system from the critical communication path, achieving very higher performance than standard protocol stacks, such as TCP/IP or UDP/IP.
This thesis describes a user-level communication system for a new System Area Network, QNIX (Quadrics Network Interface for LinuX), currently in development at Quadrics Supercomputers World. QNIX is a standard 64-bit PCI card, equipped with a RISC processor and up to 256 MB local RAM. This allows to move most part of the communication task on the network interface and to hold all related data structures in the QNIX local RAM, avoiding heavy swap operation to host memory.
The QNIX communication system gives user processes direct and protected access to the network device. It consists of three parts: a user library, a driver and a control program running on the network interface.
The user library is the interface to the QNIX communication system. The driver manages the unique two operating system services, registration of user processes to the network device and virtual address translation for DMA transfers. The control program running on the network interface schedules the network device among requiring processes, executes zero-copy data transfers and handles flow control.
First experimental results show that the QNIX communication system achieves about 180 MB/s payload bandwidth for message sizes 4 KB and 3 s one-way latency for zero-payload packets. Bandwidth achieved is about 90% of the expected peak for the QNIX network interface.
I would like to thank my advisor Alberto Negro for the interesting discussions that contributed to my research work and for the suggestions he gave me during the preparation of this thesis.
I thank the Quadrics Supercomputers World for its support to my research activity. In particular I want to thank Drazen Stilinovic and Agostino Longo that allowed me to work in QSW offices at Rome. Moreover I thank the QSW R&D Department, in particular Stefano Pratesi and Guglielmo Lulli that helped me a lot with their knowledge and are my co-authors of a recent publication.
I want also to thank Roberto Marega who was my first advisor in QSW and Vittorio Scarano for his moral support.
Table of Contents
Chapter 1 1
1.1HPC: from Supercomputers to Clusters 2
1.1.1Vector Supercomputers 3
1.1.2Parallel Supercomputers: SIMD Machines 4
1.1.3Parallel Supercomputers: MIMD Machines 5
1.1.4Clusters of Workstations and Personal Computers 6
1.2System Area Networks 8
1.2.5SCI (Scalable Coherent Interface) 13
1.2.6Memory Channel 14
1.3Communication Systems for SANs 15
1.4The QNIX Project 18
1.5Thesis Contribution 20
1.6Thesis Organisation 21
Chapter 2 23
User-level Communication Systems 23
2.1Active Messages 25
2.1.1Active Messages on Clusters 27
2.1.2Active Messages II 28
2.2Illinois Fast Messages 30
2.2.1Fast Messages 1.x 32
2.2.2Fast Messages 2.x 33
2.3Virtual Memory Mapped Communication (VMMC) 38
2.5Virtual Interface Architecture (VIA) 42
Chapter 3 46
The QNIX Communication System 46
3.2Design Choices 49
3.3Data Structures 52
3.3.1NIC Process Structures (NPS) 52
3.3.2Host Process Structures (HPS) 56
3.3.3NIC System Structures (NSS) 58
3.3.4Host System Structures (HSS) 64
3.4The Device Driver 66
3.5The NIC Control Program 68
3.5.1System Messages 71
3.5.2NIC Process Commands 74
3.5.3NIC Driver Commands 78
3.6The QNIX API 79
3.7Work in Progress and Future Extensions 84
Chapter 4 86
First Experimental Results 86
4.1Development Platform 87
4.2Implementation and Evaluation 88
For a long time High Performance Computing (HPC) had been based on expensive parallel machines built with custom components. There was no common standard for such supercomputers, so each of them had its own architecture and programming model tailored on a specific class of problems. Consequently while they were very powerful in their application domain, generally they performed very poorly out of it. Moreover they were hard to program and application codes were no easily portable from a platform to another. For these and other reasons parallel processing has never been exploited too much. Anyway the dramatic improvement in processor technology, jointed with the ever more increasing reliability and performance of network devices and cables, gives the parallel processing a new chance: to use clusters of workstations and even personal computers as an efficient, low cost parallel machine.
Since cluster performance depends strictly on the interconnection network and communication system software, this new trend in HPC community has generated a lot of research efforts in such fields. So some years ago a new class of networks, the so called System Area Networks (SANs), specifically designed for HPC on clusters, began to appear. Such networks are equipped with user-level communication systems bypassing the operating system on all critical communication paths. In this way the software communication overhead is considerably reduced and user applications can benefit from the high performance of SAN technology. In the last years several user-level communication systems have been developed, differing in the types of primitives they offer for data transfers, the way incoming data is detected and handled, the type and amount of work they move on the NIC (Network Interface Card) if it is programmable.
This thesis describes the user-level communication system for a new SAN, QNIX (Quadrics Network Interface for LinuX), currently in development at Quadrics Supercomputers World. QNIX is a standard 64-bit PCI card, equipped with a RISC processor and up to 256 MB local RAM. This allows to move most part of the communication task on the NIC, so that the host processor is unloaded as much as possible. Moreover memory dimension allows to hold all data structures associated with the communication task in the NIC local RAM, avoiding heavy swap operation between host and NIC memory. The communication system described here consists of three parts: a user library, a driver and a control program running on the NIC processor. The library functions allow user processes to give commands directly to the network device, bypassing completely the operating system. The driver is responsible for registering to the NIC the processes will need network services, mapping the suitable resources into the process address space, locking user memory and translating virtual addresses. The control program allows the NIC to serve the registered process requests with fair politics, scheduling it among them.
This chapter is structured as follows. Section 1.1 describes the HPC evolution from supercomputers to clusters of workstations and personal computers. Section 1.2 introduces SANs and discusses the general features of the most famous ones. Section 1.3 describes the basic principles of the communication systems for this kind of networks. Section 1.4 introduces the QNIX project. Section 1.5 introduces the problems addressed by this thesis. Section 1.6 describes the structure of the thesis.