The Landscape of Seven Questions and Seven Dwarfs for Parallel Computing Research: a view from Berkeley



Download 232.56 Kb.
Page7/13
Date28.01.2017
Size232.56 Kb.
#8845
1   2   3   4   5   6   7   8   9   10   ...   13

4.2 Memory Unbound


The DRAM industry has dramatically lowered the price per gigabyte over the decades, to $100 per gigabyte today from $10,000,000 per gigabyte in 1980 [Hennessy and Patterson 2006]. Alas, as mentioned in CW #8 in Section 2, the number of processor cycles to access main memory has grown dramatically as well, from a few processor cycles in 1980 to hundreds today. Moreover, the memory wall is the major obstacle to good performance for many dwarfs.
The good news is that if we look inside a DRAM chip, we see many independent, wide memory blocks. [Patterson et al 1997] For example, a 1-Gbit DRAM is composed of hundreds of banks, each thousands of bits wide. Clearly, there is potentially tremendous bandwidth inside a DRAM chip waiting to be tapped, and the memory latency inside a DRAM chip is obviously much better than from separate chips across an interconnect.
Although we cannot avoid global communication in the general case with thousands of processors, some important classes of computation do almost all local accesses and hence can benefit innovative memory designs. Independent task parallelism is one example (see section 5.3).
Hence, in creating a new hardware foundation for parallel computing hardware, we shouldn’t limit innovation by assuming main memory must be in separate DRAM chips connected by standard interfaces.
Another reason to innovate in memory is that increasingly, the cost of hardware is shifting from processing to memory. The old Amdahl rule of thumb was that a balanced computer system needs about 1 MB of main memory capacity per MIPS of processor performance [Hennessy and Patterson 2006]. Manycore designs will unleash a much higher number of MIPS in a single chip, which suggests a much larger fraction of silicon will be dedicated to memory in the future.

4.3 Interconnection networks


[[Editor: This section looks primarily at chip-to-chip communication; any suggestions about on-chip communication to include?]]

Initially, applications are likely to treat multicore and manycore chips simply as SMPs. However, multicore offers unique features that are fundamentally different than SMP and therefore should open up some significant opportunities to exploit those capabilities.



  • The inter-core bandwidth for multicore chips as a ratio of clock rate is significantly different than is typical for most SMPs. The cores can speak to each other at 1-1 bandwidth with the CPU core, whereas conventional SMPs must make do with far lower bandwidths.

  • Inter-core latencies are far less than is typical for an SMP system (by an order of magnitude at the very least).

  • Multicore chips are likely to offer a lightweight sync that only refers to memory consistency state on-chip. The semantics of these fences are very different than what we are used to on SMPs.

If we simply treat multicore chips as SMPs (or worse yet, as more processors for MPI applications), then we may miss some very interesting opportunities for algorithms designs that can better exploit those features.


Currently, we rely on conventional SMP cache-coherence protocols to coordinate between cores. Such protocols may be too rigorous in their enforcement of the ordering of operations presented to the memory subsystem and may obviate alternative approaches to parallel computation that are more computationally efficient and can better exploit the unique features afforded by the inter-processor communication capabilities of manycore chips.
For example, mutual exclusion locks and barriers on SMPs are typically implemented using spin-waits that constantly poll a memory location to wait for availability of the lock token. This approach floods the SMP coherency network with redundant polling requests and generally wastes power as the CPU cores are engaged in busy work. The alternative to spin-locks is to use hardware interrupts, which tends to increase the latency in response time due to the overhead of the context switch, and some cases, the additional overhead of having the OS mediate the interrupt. Hardware support for lightweight synchronization constructs and mutual exclusion on the manycore chip will be essential to exploit the much lower-latency links available on chip.
An even more aggressive approach would be to move to a transactional model for memory consistency management. The transactional model enables non-blocking synchronization (no stalls on mutex locks or barriers)[Rajwar 2002]. The Transactional Model (TM) can be used together with shared memory coherence, or in the extreme case, a Transactional Coherence & Consistency (TCC) model [Kozyrakis 2005] can be applied globally as a substitute for conventional cache-coherence protocols. These mechanisms must capture the parallelism in applications and allow the programmer to manage locality, communication, and fault recovery independently from system scale or other hardware specifics. Rather than depending on a mutex to prevent potential conflicts in the access to particular memory locations, the transactional model commits changes in an atomic fashion. In the transactional model, the computation will be rolled back and recomputed if a read/write conflict is discovered during the commit phase. In this way, the computation becomes incidental to communication.
From a hardware implementation standpoint, multicore chips have initially employed buses or crossbar switches to interconnect the cores, but such solutions are not scalable to 1000s of cores. We need to effectively build and utilize network topology solutions with costs that scale linearly with system size to prevent the complecity of the interconnect architecture of manycore chip implementations from growing unbounded. Scalable on-chip communication may require consideration of interconnect concepts that are already familiar to inter-node communication, such as packet-switched networks [Dally2001]. Already chip implementations such as the STI Cell employ multiple ring networks to interconnect the 9 processors on board the chip and employs software managed memory to communicate between the cores rather than conventional cache coherency protocols. We may look at methods like the transactional model to enable more scalable hardware models for maintaining coherency and consistency, or may even look to messaging models that are more similar to the messaging layers seen on large scale cluster computing systems. We need to effectively build and utilize network topology solutions with costs that scale linearly with system size to prevent the interconnect architecture from dominating the overall cost of manycore systems.
We While there has been research into statistical traffic models to help refine the design of Networks-on-Chip (NoCs)[Soteriou2006], we believe the 7+3 DwarvesDwarfs can provide even more useful insights into communication topology and resource requirements for a broad-array of applications. Based on studies of the communication requirements of existing massively concurrent scientific applications that cover the full range of “dwarfs” [Vetter and McCracken 2001] [Vetter and Yoo 2002] [Vetter and Meuller 2002] [Kamil et al 2005],we make the following observations about the communication requirements in order to develop a more efficient and custom-tailored solution:

  • The collective communication requirements are strongly differentiated from point-to-point requirements. Since latency is likely to improve much more slowly than bandwidth (see CW #6 in Section 2), the separation of concerns suggests adding a separate latency-oriented network dedicated to the collectives.[Hillis and Tucker 1993] [Scott 1996] As a recent example at large scale, the IBM BlueGene/L has a “Tree” network for collectives in addition to a higher-bandwidth “Torus” interconnect for point-to-point messages. Such an approach may be beneficial for chip interconnect implementations that employ 1000s of cores.

  • With the exception of the 3D FFT, the point-to-point messaging requirements tend to exhibit a low degree of connectivity, thereby utilizing only a fraction of the available communication paths through a fully-connected network such as a crossbarfat-tree. Between 5% and 25% of the available “paths” or “wires” in an a fully-connected interconnect topology are unused by a typical application. For on-chip interconnects, a non-blocking crossbar will likely be overdesigned for most application requirements and would otherwise be a waste of silicon given the resource requirements scale to the square of the number of interconnected processor cores. For applications that do not exhibit the communication patterns of the “spectral” dwarf, a lower-degree interconnect topology for on-chip interconnects may prove more space and power efficient.

  • The message sizes of most point-to-point message transfers are typically large enough that they remain strongly bandwidth-bound, even for on-chip interconnects.. Therefore, each point-to-point message requires a dedicated point-to-point pathway through the interconnect to minimize the opportunity for contention within the network fabric. So while the communication topology does not require a non-blocking crossbar, an alternative approach would still need to provision unique pathways for each message either by carefully mapping the communication topology onto the on-chip interconnect topology.

  • Despite the low topological degree of connectivity, the communication patterns are not isomorphic to a low-degree, fixed-topology interconnect such as a torus, mesh, or hypercube. Therefore, assigning a dedicated path to each point-to-point message transfer is not solved trivially by any given fixed-degree interconnect topology.

The communication patterns observed thus far are very closely related to the underlying communication/computation pattern. The relatively small set of dwarfs suggest that the reconfiguration of the interconnect may need to target a relatively limited set of communication patterns. It also suggests that the programming model be targeted at higher-level abstractions for describing those patterns.


Using One can use less-expensive circuit switches, a simpler interconnect design can less complex circuit switches to provision dedicated wires that enable the interconnect to adapt to communication topology of the application at runtime.even emulate many different interconnect topologies that would otherwise require fat-tree networks [Kamil et al 2005, Shalf et al 2005]. The topology can be incrementally adjusted to match the communication topology requirements of a code at runtime. There are also considerable research opportunities available for studying compile-time instrumentation of codes to infer communication topology requirements at compile-time or to apply auto-tuners (see Section 5.1) to the task of inferring an optimal interconnect topology and communication schedule.
As such, the use of a passive optical circuit switchTherefore, a hybrid approach to on-chip interconnects that employs both active switching and passive circuit-switched components elements presentss the potential to reduce wiring complexity and costs for the interconnect by eliminating unused circuit paths and switching capacity through custom runtime reconfiguration of the interconnect topology.


Download 232.56 Kb.

Share with your friends:
1   2   3   4   5   6   7   8   9   10   ...   13




The database is protected by copyright ©ininet.org 2024
send message

    Main page