Krste Asanovíc, Rastislav Bodik, Bryan Catanzaro, Joseph Gebis,
Parry Husbands, Kurt Keutzer, David Patterson,
William Plishker, John Shalf, Samuel Williams, and Katherine Yelick EECS Technical Report <>
August 15, 2006
Abstract
The recent switch to parallel microprocessors is a milestone in the history of computing. Industry has laid out a roadmap for multicore designs that preserves the programming paradigm of the past via binary-compatibility and cache-coherence. Conventional wisdom is now to double the number of cores on a chip with each silicon generation.
A multidisciplinary group of Berkeley researchers met for 16 months to discuss this change. Our view is that this evolutionary approach to parallel hardware and software may work for 2-way and 4-way, but is likely to face diminishing returns as designs reach 16-way and 32-way, just as returns on greater instruction-l level parallelism hit a wall.
We believe that that much can be llearned from by examining the success of parallelism at the extremes of the computing spectrum, namely from embedded computing and high performance computing. This led us to frame the parallel landscape with seven questions, and to recommend the following:
The target should be 1000s of cores per chip, as this hardware is the most efficient in MIPS per watt, MIPS per area of silicon, and MIPS per development dollar.
Instead of traditional benchmarks, use 7+ “dwarfs” to design and evaluate parallel programming models and architectures. (A dwarf is an numerical algorithmic method that captures a patterns of computation and communication.)
“Autotuners” should play are larger role than conventional compilers in translating parallel programs.
To maximize programmer productivity, programming models should be independent of the number of processors.
To maximize application efficiency, programming models should support a wide range of data types and successful models of parallelism: data- level parallelism, independent task parallelism, and instruction-level parallelism.
Since real world applications are naturally parallel and hardware is naturally parallel, ; what we need is a programming model and a supporting architecture that is naturally parallel. Researchers have the rare opportunity to re-invent these cornerstones of computing, provided they simplify the efficient programming of highly parallel systems.
<
Abstract: Dave will do a new version that includes more description of hitting the wall that will result from gradually scaling a cache-coherent, binary-compatible, programming paradigm-preserving model. [√]
Motivation: Dave will add in a few conventional wisdoms, including some version of "Old CW: less than linear speedup with parallel computer is a failure; New CW: any speedup with parallel computers is a success"; "Old CW: don't bother parallelizing your app, since you can just wait and run it faster on a new sequential computer; New CW: faster sequential computers aren't coming around as quickly anymore" [√]
3.1 and 3.2: John will merge the sections to better describe dwarfs as models of both computation and communication
3.3: Dave will emphasize that the dwarfs aren't cherry-picked to be good parallel applications; they are the programs that we expect to see in the future, regardless of how well they parallelize[Done ***]
3.5: John will revise the diagram and communicate with YK Chen to get a version that fits better with our text and has fewer three-letter acronyms
3.5: Parry will add a machine learning column to figure 7, and update the RMS column with YK Chen's comments
4.1.1: Kurt will better describe the "small is beautiful" argument for 50k gates
4.1.2: Kurt will address double-precision floating-point and the 50k gate limit; will also look at Cell, Cisco, and Niagara cores with regards to that limit
4.1.3: Kurt will address the idea of having a heavy duty core to run the operating system, and to address Amdahl's Law; will also describe the reconfigurability is good if you expect to see 30x speedup, and otherwise you can turn it off
4.2: Dave will describe how communication will always be necessary, but independent task parallelism is also used so local memory is good [√]
4.3: John will clean on this section; describe HW communications and synchronization, on-chip routing
5.0: Parry will work on reorganizing and rectifying sections 5.0 and 5.3; making the logical flow between them more consistent
Figure 8: Will will add more models to the chart Krste: someone suggested that the entry for StreamIt was incorrect, and we thought you might be able to comment on that
5.1: Dave will trim down the section and reduce historical perspective [√]
5.2: Parry will make it clear that software support does not imply lots of hardware support; he will also address Greg Morrisett's comments on this section
5.3: Parry will address Kozyrakis's comments about memory models; will also reorganize this section to flow better with 5.0
5.5: Parry and John will talk about the possibilities of using VMs to provide small or no OSes on individual nodes, and will talk about reliability and security provided by VMs.
6: Dave will update this section to reflect current RAMP plans, and respond to Rowen's comments about co-simulation [√]
>>
1.0 Introduction
The computing industry changed course in 2005 when Intel followed the lead of IBM’s Power 4 and Sun Microsystems’ Niagara processor in announcing that its high performance microprocessors would henceforth rely on multiple processors or cores. The new industry buzzword “multicore” captures the plan of doubling the number of standard cores per die with every semiconductor process generation. Switching from sequential to modestly parallel computing will make programming much more difficult without rewarding this greater effort with a dramatic improvement in power-performance. Hence, multicore is unlikely to be the ideal answer.
A diverse group of U.C. Berkeley researchers from many backgrounds—circuit design, computer architecture, massively parallel computing, computer-aided design, embedded hardware and software, programming languages, compilers, scientific programming, and numerical analysis—met between February 2005 and June 2006 to discuss parallelism from these many angles. We intended to borrow the good ideas regarding parallelism from different disciplines, and this report is the result. We concluded that sneaking up on the problem of parallelism via multicore solutions was likely to fail and we desperately need a new solution for parallel hardware and software.
Figure 1 shows our seven critical questions for parallel computing. We don’t claim to have the answers in this report, but we do offer non-conventional and provocative perspectives on some questions and state seemingly obvious but sometimes neglected perspectives on others.
Compatibility with old binaries and C programs is valuable to industry, so some researchers are trying to help multicore product plans succeed. We’ve been thinking bolder thoughts, however. Our aim is thousands of processors per chip for new applications, and we welcome new programming models and new architectures if they simplify the efficient programming of such highly parallel systems. Rather than multicore, we are focused on “manycore.” Successful manycore architectures and supporting software technologies could reset microprocessor hardware and software roadmaps for the next 30 years.
Note that there is a tension between embedded and high performance computing, which surfaced in many of our discussions. We argue that these two ends of the computing spectrum have more in common looking forward than they did in the past. First, both are concerned with power, whether it’s battery life for cell phones or the cost of electricity and cooling in a data center. Second, both are concerned about with hardware utilization. Embedded systems areis always sensitive to costs, and but you also need to use hardware efficiently when you spend $10M to $100M for high-end servers. Third, as the size of embedded software increases over time, the fraction of hand tuning must be limited and so the importance of software reuse must increase. Fourth, since both embedded and high-end servers now connect to networks, both need to prevent unwanted accesses and viruses. Thus, the need is increasing for some form of operating system for for protection in embedded systems, as well as resource for resource sharing and scheduling.
Perhaps the biggest difference between the two targets is the traditional emphasis on real- time computing in embedded, where the computer and the program need to be just fast enough to meet the deadlines, and there is no benefit to running faster. Running faster is almost always valuable in server computing. As server applications become more media-oriented, real time may become more important for server computing as well.
We note that thisThis report borrows many ideas from both embedded and high performance computing.
While we’re not sure it can be accomplished, it would be desirable if common programming models and architectures would work for both embedded and server communities.
The organization of the report follows the seven questions of Figure 1. Section 2 documents the reasons for the switch to parallel computing by providing a number of guiding principles. Section 3 reviews the left tower in Figure 1, which represents the new applications for parallelism. It describes the “seven dwarfs”, which we believe will be the computational kernels of many future applications. Section 4 reviews the right tower, which is hardware for parallelism, and we separate the discussion into the classical categories of processor, memory, and switch. Section 5 covers programming models and other systems software, which is the bridge that connects the two towers in Figure 1. Section 6 discusses measures of success and describes a new hardware vehicle for exploring parallel computing. We conclude with a summary of our perspectives. Given the breadth of topics we address in the report, we provide about 75 references for readers interested in learning more.