The QNIX communication system implementation and experimentation described here have been realised on a Dell PowerEdge 1550 system running the Linux 2.4.2 operating system and equipped with the 64-bit PCI IQ80303K card. This is the evaluation board of the Intel 80303 I/O processor.
The Dell PowerEdge 1550 system has the following features: 1 GHz Intel Pentium III, 1 GB RAM, 32 KB first level cache (16 KB instruction cache and 16 KB two-way write-back data cache), 256 KB second level cache, 133 MHz front side memory bus and 64-bit 66 MHz PCI bus with write-combining support.
The Intel 80303 I/O processor is designed for being used as the main component of a high performance, PCI based intelligent I/O subsystem. It is a 100 MHz processor 80960JTCore able to execute one instruction per clock cycle. The IQ80303K evaluation board has the following features: 64 MB of 64-bit SDRAM (but it can support up to 512 MB), 16 KB two-way set-associative instruction cache, 4 KB direct-mapped data cache, 1 KB internal data RAM, 100 MHz memory bus, 64-bit 66 MHz PCI interface, address translation unit connecting internal and PCI buses, DMA controller with two independent channels, direct addressing to/from the PCI bus, unaligned transfers supported in hardware. Moreover additional features for development purpose are available, among them: serial console port based on 16C550 UART, JTAG header, general purpose I/O header and 2 MB Flash ROM containing the MON960 monitor code.
A number of software development tools are available for the IQ80303K platform. We have used the Intel CTOOLS development toolset. It includes advanced C/C++ compilers, assembler, linker and utilities for execution profiling. To establish serial or PCI communication with the IQ80303K evaluation board, we have used the GDB960 debugger. Interface between this debugger and MON960 is provided by the MON960 Host Debugger Interface, while communication between them is provided by the SPI610 JTAG Emulation System. This is a Spectrum Digital product and represents the default communication link between the host development environment and the evaluation board. It is based on the 80303 I/O processor JTAG interface.
Implementation and Evaluation
The current QNIX communication system implementation can manage up to 64 registered processes on the network device. Every process can have 32 pending send and 32 pending receive operations. The maximum message size is 2 MB, larger messages must be sent with multiple operations. The maximum Buffer Pool size is 8 MB.
The reason for limiting to 64 the number of processes that contemporary can use the network device is that the IQ80303K has only 64 MB of local memory. Every Virtual Network Interface takes 516 KB (512 KB for the Context Regions and 4KB for all the other components), so 64 Virtual Network Interfaces take about half NIC memory. The remaining, except few KB for static NSS and NIC control program code, is left for buffering and dynamic NSS. This is for guaranteeing that a buffer for incoming data is very probably available also in heavy load situations.
Among design issues discussed in section 3.2, data transfer and address translation need some considerations here.
About data transfer mode from host to NIC, on our platform it seems that programmed I/O is more convenient than DMA for data transfers up to 1024 bytes, so we fix such value as the maximum size for short messages. For programmed I/O with write-combining we have found that PCI bandwidth becomes stable around 120 MB/s for packet sizes from 128 bytes onwards. DMA transfers, instead, reach a sustained PCI bandwidth of 285 MB/s for packet sizes 4 KB.
About address translation, the idea of using a pre-locked and memory translated buffer pool seems to make sense only for buffer sizes < 4 KB. When a process requests buffer lock and translation on the fly, the time spent in system calls is negligible compared to the time spent for writing the page table in NIC memory. Thus we have measured on a side the cost of programmed I/O page table transfer and on the other the cost of a memory copy. On the Dell machine we have observed an average value of 700 MB/s for memory bandwidth, that is a 4KB memory copy costs 5.5 µs. To lock and translate a memory page we have measured 1.5 µs and 1µs is necessary for writing the related Descriptor (16 bytes) in the Context Region. So the whole operation costs 2.5 µs. When the buffer size increases, this performance difference becomes more significant. This is because for every 4KB to be copied only 16 bytes must be cross the I/O bus. Moreover when the number of Descriptors increases, the PCI performance reaches its sustained programmed I/O performance. System call overhead is no significant when the number of pages to be locked and translated becomes greater than 2. For a 2MB buffer we have found that a memory copy costs 2857µs versus 64µs of the locking, translating and Descriptor transfer into the corresponding Context Region. When the buffer size is less than 4KB, instead, with lock and translating on the fly, we have to pay always the whole cost of 2.5 µs, while the memory copy is paid only for the real buffer size. For a 2 KB buffer the memory copy costs 2.8 µs, for a 1.5 KB buffer 2 µs and for 1 KB buffer 1.4 µs.
Our first evaluation tests on the QNIX communication system showed about 3 µs one-way latency for zero-payload packets and 180 MB/s bandwidth for message sizes 4 KB. Here with one-way latency we means the time from the sender process posts the Send command in its Command Queue until the destination NIC control program sets the corresponding Receive Doorbell for the receive process. This value has been calculated adding the cost for posting the Send command (1 µs), the cost for the source NIC control program to prepare packet header (0.5 µs), the estimated NIC-to-NIC latency (0.2 µs) and the cost for the destination NIC control program to DMA set the corresponding Receive Doorbell for the receiver process (1 µs). Asymptotic payload bandwidth, instead, is the user payload injected into the network per time unit. We have obtained 180 MB/s measuring the bandwidth that is wasted because of the time that the NIC control program spent in its internal operation. Considering that the expected peak bandwidth for the QNIX network interface is about 200 MB/s, our communication system is able to deliver user applications up to the 90% of the available bandwidth.
Bibliography
[ABBvE94] V. Avula, A. Basu, V. Buch, T. von Eicken. Low-Latency Communication over ATM Networks using Active Messages. In Proceeding of Hot Interconnects II, pp. 60-71, Stanford, California, August 1994
[ABH+00] B. Abali, M Banikazemi, L. Hereger, V. Moorthy, D.K. Panda. Efficient Virtual Interface Architecture Support for IBM SP Switch-Connected NT Clusters. In Proceedings of International Parallel and Distributed Processing Symposium (IPDPS'2000), pp. 33-42, May 2000
[ABD+94] R. Alpert, M.A. Blumrich, C. Dubnicki, E.W. Felten, K. Li. Virtual Memory Mapped Network Interface for the SHRIMP Multicomputer. In Proceedings of the 21st Annual Symposium on Computer Architecture, pp. 142-153, April 1994
[ABD+98] S. Araki, A. Bilas, C. Dubnicki, J. Edler, K. Konishi, J. Philbin. User-Space Communication: A Quantitative Study. In SC98, High Performance Networking and Computing Conference, 1998
[ABS99] H. Abdel-Shafi, J.K. Bennet, E. Speight. Realizing the Performance Potential of the Virtual Interface Architecture. In Proceedings of the 13th ACM International Conference on Supercomputing (ICS), June 1999.
[ACP95] T. Anderson, D. Culler, D. Patterson and the NOW Team. A Case for NOW (Network of Workstations). IEEE Micro, vol. 15, pp. 54-64, February 1995
[AHSW62] J.P. Anderson, S.A. Hoffman. J. Shifman and R.J. Williams. D825 - A Multiple-Computer System for Command Control. Proceedings of AFIPS Conference, Vol.22, pp. 86-96, 1962
[AMZ96] C.J. Adams, B.J. Murphy, S. Zeadally. An Analysis of Process and Memory Models to Support High-Speed Networking in a UNIX Environment. In Proceedings of the USENIX Annual Technical Conference, San Diego, California, January 1996
[Bat et al.93] C. Battista et al. The APE-100 Computer: the Architecture. International Journal of High Speed Computing 5, 637, 1993
[BBvEV95] A. Basu, V. Buch, T. von Eicken, W. Vogels. U-Net: A User-Level Network Interface for Parallel and Distributed Computing. In Proceedings of 15th Symposium on Operating System Principles, Copper Mountain, Colorado, December 1995
[BBK+68] G.H. Barnes, R.M Brown, M. Kato, D.J. Kuck, D.L. Slotnick, R.A. Stokes. The ILLIAC IV Computer. IEEE Trans. on Computers pp. 746-757, 1968
[BBN88] BBN Advanced Computers Inc. Overview of the Butterfly GP1000. November 1988
[BBR98] H.E. Bal, R.A.F. Bhoedjang, T. Ruhl. User-Level Network Interface Protocols. IEEE Computer, Vol. 31, N. 11, November 1998
[BBR+96] D.J. Becker, M.R. Berry, C. Res, D. Savarese, T. Sterling. Achieving a Balanced Low-Cost Architecture for Mass Storage Management through Multiple Fast Ethernet Channels on the Beowulf Parallel Workstation. In Proceedings of International Parallel Processing Symposium, 1996
[BCD+97] A. Bilas, Y. Chen, S. Damianakis, C. Dubinicki, K. Li. VMMC-2: Efficient Support for Reliable, ConnectionOriented Communication. In Hot Interconnects'97, Stanford, CA, Apr. 1997
[BCF+95] N. Bode, D. Cohen, R. Felderman, A. Kulawik, C. Sietz, J. Seizovic, W. Su. Myrinet – A Gigabit-Per-Second Local-Area Network. IEEE Micro, pp. 29-36, February 1995
[BCG+97] M. Buchanan, A. Chien, L. Giannini, K. Hane, M. Lauria, S. Pakin. High Performance Virtual Machines (HPVM): Clusters with Supercomputing APIs and Performance. In Proceedings of the 8th SIAM Conference on Parallel Processing for Scientific Computing, Minneapolis, MN, March 1997
[BCG98] P. Buonadonna, D.E. Culler, A. Geweke. An Implementation and Analysis of the Virtual Interface Architecture. In Proceedings of the ACM/IEEE SC98 Conference, Orlando, Florida, November 1998
[BCKP99] G. Bruno, A. Chien, M. Katz, P. Papadopoulos. Performance Enhancements for HPVM in Multi-Network and Heterogeneous Hardware. In Proceedings of PDC Annual Conference, December 1999
[BCM+97] M. Bertozzi, G. Conte, P. Marenzoni, G. Rimassa, P. Rossi, M. Vignali. An Operating System Support to Low-Overhead Communications in NOW Clusters. In Proceedings of the 1st International Workshop on Communication and Architectural Support for Network-Based Parallel Computing, San Antonio, Texas, February 1997
[BDF+94] M.A. Blumrich, C. Dubnicki, E.W. Felten, K. Li, M.R. Mesarina. Two Virtual Memory Mapped Network Interface Designs. In Proceedings of the Hot Interconnect II Symposium, pp. 134-142. August 1994
[BDLP97] A. Bilas, C. Dubnicki, K. Li, J. Philbin. Design and Implementation of Virtual Memory-Mapped Communication on Myrinet. In Proceedings of the International Parallel Processing Symposium 97, April 1997
[BDR+95] D.J. Becker, J.E. Dorband, U.A. Ranawake, D. Savarese, T. Sterling, C.V. Packer. Beowulf: A Parallel Workstation for Scientific Computation. In Proceedings of 24th International Conference on Parallele Procesing, Oconomowoc, Wisconsin, August 1995
[BvEW96] A. Basu, T. von Eicken, M. Welsh. Low-Latency Communication over Fast Ethernet. Lecture Notes in Computer Science, vol. 1123, 1996
[BvEW97] A. Basu, T. von Eicken, M. Welsh. Incorporating Memory Management into User-Level Network Interfaces. In Proceedings of Hot Interconnects V, Stanford, August 1997
[BH92] R. Berrendorf, J. Helin. Evaluating the basic performance of the Intel iPSC/860 parallel computer. Concurrency: Practice and Experience. 4(3), pp. 223-240, May 1992.
[BKR+99] U. Brüning, J. Kluge, L. Rzymianowicz, P. Schulz, M. Waack. Atoll: A Network on a Chip. PDPTA 99. LasVegas, June 1999
[Bla90] T. Blank. The MasPar MP-1 Architecture. In Proceedings of COMPCON, IEEE Computer Society International Conference, pages 20-23, San Francisco, California, February 1990
[Blo59] E. Bloch. The Engineering Design of the Stretch Computer. Proceedings of Eastern Joint Computer Conference, pp. 48-58, 1959
[BM76] D.R. Boggs, R.M. Metcalfe. Ethernet: Distributed Packet Switching for Local Computer Networks. Communications of the ACM, Vol. 19, No. 5, pp. 395 – 404, July 1976
[BMW00] T. Bryan, L. Manne, S. Wolf. A Beginner's Guide to the IBM SP. University of Tennessee, Joint Institute for Computational Science, 2000
[Bog89] B.M. Boghosian. Data-Parallel Computation on the CM-2 Connection Machine, Architecture and Primitives. In Lectures in Complex Systems, E. Jen, ed., 1989
[BP93] D. Banks, M. Prudence. A High-Performance Network Architecture for a PA-RISC Workstation. IEEE, Journal of Selected Areas in Communications, Vol 11, N. 2, February 1993
[BS99] P. Bozeman, B. Saphir. A Modular High Performance Implementation of the Virtual Interface Architecture. In Proceedings of the 1999 USENIX Annual Technical Conference, Extreme Linux Workshop, Monterey, California, June 1999
[CC97] G. Chiola, G. Ciaccio. Implementing a Low Cost, Low Latency Parallel Platform. Parallel Computing, (22), pp. 1703-1717, 1997
[CCM98] B.N. Chun, D.E. Culler, A.M. Mainwaring. Virtual Network Transport Protocols for Myrinet. IEEE Micro, pp. 53-63, January 1998
[CEGS92] D. Culler, T. Eicken, S. Goldstein, K. Schauser. Active Messages: a Mechanism for Integrated Communication and Computation. In Proceedings of the 19th Annual Symposium on Computer Architecture, pp. 256-266, May 1992
[CIM97] Compaq, Intel, Microsoft. Virtual Interface Architecture Specification. Version 1.0, http://www.viarch.org/, December 1997
[CKP97] A. Chien, V. Karamcheti, S. Pakin. Fast Messages (FM): Efficient, Portable Communication for Workstation Clusters and Massively-Parallel Processors. IEEE Concurrency, 5(2), pp. 60-73, April-June 1997
[CLM97] S.S. Lumetta, A.M. Mainwaring, D.E. Culler. Multi-Protocol Active Messages on a Cluster of SMP's. In Proceedings of Supercomputing 97, Sao Jose, USA, November 1997
[CLP95] A. Chien, M. Lauria, S. Pakin. High Performance Messaging on Workstations: Illinois Fast Messages (FM) for Myrinet. In Proceedings of the Supercomputing, December 1995
[CLP98] A. Chien, M. Lauria, S. Pakin. Efficient Layering for High Speed Communication: Fast Messages 2.x. In Proceedings of the 7th High Performance Distributed Computing, July 1998.
[CM96] D.E. Culler, A. M. Mainwaring. Active Messages Application Programming Interface and Communication Subsystem Organization. Berkely Technical Report, October 1996
[CM99] D.E. Culler, A.M. Mainwaring. Design Challenges of Virtual Networks: Fast, General-Purpose Communication. In Proceedings of the 7th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 119-130, Atlanta, GA USA, May 1999
[Com00] Compaq Computer Corporation. Compaq ServerNet II SAN Interconnect for Scalable Computing Clusters. Doc. N. TC000602WP, June 2000
[CP97] M. Coli, P. Palazzari. Virtual Cut-Through Implementation of the HB Packet Switching Routing Algorithm. PDP 97, Madrid, January 1997
[DDP94] B.S. Davies, P. Druschel, L. Peterson. Experiences with a High-Speed Network Adaptor: A Software Perspective. In Proceedings of SIGCOMM Conference, London, September 1994
[DFIL96] C. Dubnicki, E.W. Felten, L. Iftode, K. Li. Software Support for Virtual Memory Mapped Communication. In Proceedings of International Conference on Parallel Processing, April 1996
[Dol96] Dolphin Interconnect Solutions Incorporated. The Dolphin SCI Interconnect. White Paper. February 1996
[DLP01] A. De Vivo, G. Lulli, S. Pratesi. QNIX: A Flexible Solution for High Performance Network. WSDAAL 2001, Como, September 2001
[DT92] D. Dunning, R. Traylor. Routing Chipset for Intel Paragon Parallel Supercomputer. In Proceedings of Hot Chips 92 Symposium, August 1992
[EE91] M. Eberlein, E. Eastbridge. MP-2 Guide. University of Tennessee, 1991
[Flo61] J. Fotheringham. Dynamic storage allocation in the Atlas computer, including an automatic use of a backing store. ACM Communications 4, 10 pp. 435-436, October 1961
[Fly66] M. Flynn. Very High-Speed Computing Systems. Proceedings of IEEE 54:12, December 1966
[Fujitsu96] AP1000 User’s Guide. Fujitsu Laboratories Ltd, 1996
[Gig99] GigaNet cLAN Family of Products. http://www.giganet.com/, 1999
[Gil96] R. Gillett. Memory Channel Network for PCI. IEEE Micro, pp. 12-18, February 1996
[Hil85] W.D. Hillis. The Connection Machine. MIT Press, Cambridge, MA, 1985
[HIST98] A. Hori, Y. Ishikawa, M. Sato, H. Tezuka. PM: An Operating System Coordinated High Performance Communication Library. HPCN 97, 1997
[HR98] P.J. Hatcher, R.D. Russel. Efficient Kernel Support for Reliable Communication. In Proceedings of ACM Symposium on Applied Computing, Atlanta, Georgia, February 1998
[HT72] R.G. Hintz, D.P. Tate. Control Data STAR-100 Processor Design. COMPCON '72 Digest, 1972.
[IBM71] HASP System Manual. IBM Corp., Hawthorne, N.Y., 1971.
[Inf01] Infiniband Trade Association. The Infiniband Architecture. 1.0.a Specifications, June 2001, http://www.infinibandta.org
[Intel87] iPSC/1. Intel Supercomputer Systems Division, Beaverton, Oregon, 1987
[Intel93] Paragon User's Guide. Intel Supercomputer Systems Division, Beaverton, Oregon, 1993
[Jain94] R. Jain. FDDI Handbook: High-Speed Networking with Fiber and Other Media. Addison-Wesley, Reading, MA, April 1994
[JM93] S. Johnsson, K. Mathur. All-to-All Communication on the Connection Machine CM-200. 1993
[JS95] R. Jain, K. Siu. A Brief Overview of ATM: Protocol Layers, LAN Emulation, and Traffic Management. Computer Communications Review (ACM SIGCOMM), vol. 25, No 2, pp 6-28, April 1995
[KMM+78] V. Kini, H. Mashburn, S. McConnel, D. Siewiorek, M. Tsao. A case study of C.mmp, Cm*, and C.vmp: Part I - Experiences with fault tolerance in multiprocessor systems. Proceedings of the IEEE, vol. 66, N. 10, pp. 1178-1199, October 1978.
[KNNO77] U. Keiichiro, I. Norio, K. Noriaki, M. Osamu. FACOM 230-75 Array Processing Unit. IPSJ Magazine, Vol.18 N.04 – 015, 1977
[KS93] R.E. Kessler, J.L. Schwarzmeier. Cray T3D: A New Dimension for Cray Research. In Digest of Papers, COMPCON, pp. 176-182, San Francisco, CA, February 1993
[Luk59] H. Lukoff. Design of Univac LARC System. 1959
[Mar94] R. Martin. HPAM: An Active Message Layer for a Network of HP Workstations. In Proceedings of Hot Interconnect II, pp. 40-58, Stanford, California, August 1994
[Meiko91] CS Tools: A Technical Overview. Meiko Limited, Bristol, 1991.
[Meiko93] Computing Surface 2: Overview Documentation Set. Meiko World Inc, Bristol, 1993
[Myri99] The GM Message Passing System.http://www.myri.com/, 1999
[MPIF94] Message Passing Interface Forum. MPI: A Message-Passing Interface Standard., International Journal of Supercomputer Applications, pp. 165-414, 1994
[Nug88] S.F. Nugent. The iPSC/2 direct-connect technology. In Proceedings of ACM Conference on Hypercube Concurrent Computers and Applications, pp. 51-60, 1988.
[Ram93] K. Ramakrishnan. Performance Considerations in Designing Network Interfaces. IEEE Journal on Selected Areas in Communications, Vol. 11, N. 2, February 1993
[Row99] D. Roweth. The Quadrics Interconnect Architecture, Implementation and Future Development. In Proceedings of Petaflops Workshop, EPPC, May 1999
[Rus78] R.M. Russell. The Cray-1 Computer System. In Communications of the ACM, pp. 63-72, January 1978
[Sco96] S.L. Scott. Synchronization and Communication in the T3E Multiprocessor. In Proceedings of 7th International Conference on Architectural Support for Programming Languages and Operating Systems, October 1996
[Tan95] Tandem Computers Incorporated. ServerNet Interconnect Technology. http://www.tandem.com/, 1995
[Tho80] J. E. Thornton. The CDC 6600 Project. IEEE Annals of the History of Computing, Vol. 2, No. 4, October 1980
[TMC92] Thinking Machines Corporation. Connection Machine CM-5 Technical Summary. Technical Report. Cambridge, Massachusetts, 1992
Share with your friends: |