Project summary



Download 131.61 Kb.
Page2/8
Date13.06.2017
Size131.61 Kb.
#20459
1   2   3   4   5   6   7   8

Introduction


The trend in computer architecture for the foreseeable future is clear: microprocessor designers are using copious silicon resources to integrate more and more processor cores onto a chip. In fact, within the next ten years, general purpose multicore processors will likely contain 1,000 cores or more. While this path towards ever-increasing parallelism will theoretically enable massive performance, it is unlikely that application developers will be able to harness this potential unless drastic improvements to programming are made [60]. Current approaches to multicore programming are barely manageable for multicores with two or four cores, but they certainly will not scale to massive amounts of parallelism. A new architectural mechanism---fast on-chip broadcast enabled by novel optical technology---will revolutionize the programmability of future multicore processors.

While parallel programming used to be somewhat of a black art reserved for the handful of rocket scientists that programmed supercomputers and clusters, multicore’s imminent rule will require most programmers to implement parallel applications. However, by incorporating powerful hardware and architectural mechanisms, such as a fast broadcast, and empowering programmers with the right interfaces to the underlying architecture via APIs and language facilities, all programmers will be able to efficiently construct programs that exploit multicore’s power.

The broadcast primitive, whereby one node in a parallel computer system communicates some data to every other core (or, some subset of the cores), is powerful and straightforward to use. Parallel algorithms often use broadcast to achieve synchronization and communicate sentinel values or data values. The popular SMP (symmetric multiprocessing) computational model can also use a cheap scalable broadcast to scale beyond a small handful of cores. However, broadcast operations can be expensive. Even a state-of-the-art electrical mesh network for a multicore of the future with 1000 cores would require 100’s of cycles to broadcast a single value to all cores. Such a latency is large enough that broadcast must still be used judiciously, or not at all, by programmers. In fact, programmers often implement otherwise straightforward algorithms in complicated ways to work around performance bottlenecks such as slow broadcasts. As an example, MPI has a broadcast feature, but it is rarely used for this reason. With an essentially “free” broadcast, however, programming parallel systems would be hugely simplified, as programmers could use broadcast freely.

Why do this research now? There are two major reasons. First, multicores have recently become mainstream and are facing a parallel programming crisis [60], so bold architectural changes are warranted. Second, recent breakthroughs in CMOS integration of nanophotonic components [25] provide the enabling technology to make the broadcast mechanism viable. Our photonic technology uses planar lightguide circuits, or PLCs, with wavelength division multiplexing (WDM). CMOS silicon offers all of the information capacity advantages of fiber and the precision planar processing of PLCs with the additional advantage of dense integration on a platform that is compatible with electronic integrated circuits. The basis of this dense integration (~106 photonic devices per cm2) capability is high index contrast. While fiber and PLCs typically utilize a core/cladding refractive index ratio of <0.01, the Si/SiO2 ratio is ~2.33. This design paradigm, named high index contrast, HIC, provides strong confinement of light in small volumes, such as for on-chip planar waveguides. Conventional optical devices utilize low index contrast that is neither compatible with silicon circuits in size or in terms of CMOS processes. The layer thickness and dimensions of silicon waveguides, 200x500nm, are similar to upper level metal interconnects in size with much higher information carrying capacity and no electromagnetic interference, EMI.

MIT’s microphotonics group (led by co-PI Prof. Kimerling) has designed, implemented and demonstrated HIC devices within a CMOS process flow. Their fabrication efforts are funded under Darpa’s EPIC program.

Our collaborative effort for this cross-discipline proposal focuses on designing computer architectures and programming for the novel broadcast interconnect enabled by the on-chip optical technology, researching the degree to which an efficient broadcast mechanism eases multicore programming, research on partitioning of function between the optical and electrical-digital domains, and design of interfaces and models for the optical components to facilitate their incorporation in computer systems.

ATAC creates a fundamental shift in multicore computing that utilizes an electronic mesh for short range intercore communications and a broadcast optical network optimized for global communication. Our early results indicate that the ATAC approach will simplify multicore programming significantly, it is scalable to 4000 cores/chip, and that it can also ease the off-chip memory bandwidth bottleneck by extending the optical network offchip.

Overview of the ATAC Approach


A
Figure 2: High-level view of ATAC composition consisting of a 2-D array of tiles interconnected by both an electrical mesh network and an optical broadcast network. A conceptual view of the broadcast network is shown here. The practical implementation using a set of rings is described later.
s displayed in Figure 2, the proposed ATAC microprocessor is constructed in a 2-D tile layout of computing cores, each containing data and instruction caches, communication hardware, and computational resources. The cores are interconnected via an electrial mesh network for near neighbor communication, as well as an optical broadcast network for global communication. The optical broadcast network can be thought of as a global bus whereby every core can communicate with every other core in a few cycles. However, unlike a standard electrical bus, the optical broadcast network is a global communication channel that is scalable to thousands of cores. Indeed, the ATAC architecture is being designed to scale to four thousand cores or more.

A
t a high level, ATAC overlays a standard a multicore processor with on-chip 2-D mesh network (e.g., Raw, Trips) with an optical broadcast network. Some applications require significant near-neighbor communication due to spacial locality inherent in the algorithm. However, many algorithms, such as search, are more easily coded if they can use global communication to broadcast the current best value. SMP architectures also benefit from an efficient broadcast because they can invalidate copies of data that are cached in multiple caches quickly. The ATAC approach is to blend the best of both electrical mesh and optical broadcast networks in a way that yields good performance and programming ease.

Fast global communication will become increasingly important as multicores scale to hundreds or thousands of cores. It is estimated that it will take a future multicore with 1,000 cores at least 100 cycles to perform a global broadcast operation. However, ATAC will be able to perform such an operation on a 1000-core chip in 10 cycles or less. Given that many applications rely heavily on global communication, such benefits will allow for performance improvements of over 40x for some applications, as seen in preliminary performance results.

Perhaps more importantly, fast global communication will become essential to enable programmers to manage the arduous task of programming hundreds or thousands of cores. Programmers have typically used global communications operations sparingly, as such operations often impose a significant performance penalty. Accordingly, parallel programmers often decompose otherwise straighforward algorithms into much more complicated forms to minimize the amount of global communication necessary to implement the algorithm. Furthermore, programmers have to account for potentially widely varying communication latencies, depending on the distance between the sending and receiving cores. Not surprisingly, this all means that getting good performance on standard multicore systems can be incredibly challenging, and can take a long time. On ATAC, programmers will be able to broadcast values at will, or use the underlying electrical mesh, without worrying about the typical performance impact of imprudent use of global communication operations.

The ATAC broadcast and select network optimizes power efficiency by sending the data to multiple places with little extra power consumption (the source modulator is the primary source of power consumption and is independent of the number of receivers to first order). The use of wavelength division multiplexing minimizes interconnect contention. ATAC is scalable to 4000 cores/chip and also to multi-chip and multi-box architectures. The ATAC solution is also monolithically integrated into standard microprocessor chip fabrication processes using CMOS thereby improving its performance/cost benefit.



Download 131.61 Kb.

Share with your friends:
1   2   3   4   5   6   7   8




The database is protected by copyright ©ininet.org 2024
send message

    Main page