The Landscape of Seven Questions and Seven Dwarfs for Parallel Computing Research: a view from Berkeley



Download 232.56 Kb.
Page6/13
Date28.01.2017
Size232.56 Kb.
#8845
1   2   3   4   5   6   7   8   9   ...   13

4.0 Hardware


Now that we have given our views of applications and dwarfs for parallel computing in the left tower of Figure 1, we are ready for examination of the right tower. Section 2 describes the constraints of present and future semiconductor processes, but they also present many opportunities.
We split our observations on hardware into three components first used to describe computers more than 30 years ago: processor, memory, and switch [Bell and Newell 1970].

4.1 Processors: Small is BeautifulProcessors


In the development of many modern technologies, such as steel manufacturing, we can observe that there were prolonged periods during which bigger equated to better. These periods of development are easy to identify: The demonstration of one tour de force of engineering was only superseded by an even greater one. In time due to diminishing economies of scale or other economic factors the development of these technologies inevitably hit an inflection point that forever changed the course of development. We believe that the development of general-purpose microprocessors is hitting just such an inflection point. New Wisdom #4 in section 2 states that the size of module that we can successfully design and fabricate is shrinking. New Wisdoms #1 and #2 in Section 2 state that power is proving to be the dominant constraint of present and future generations of processing elements. To support these assertions we note that several the next generation processors, such as the Tejas Pentium 4 processor from Intel, were canceled or redefined due to power consumption issues [Wolfe 2004]. Even representatives from Intel, a company generally associated with the “higher clock-speed is better” position, have warned that traditional approaches to maximizing performance through maximizing clock speed have been pushed to their limit [Borkar 1999] and [Gelsinger 2001]. In this section we look past the inflection point to ask: What processor is the best building block with which to build future multiprocessor systems?
There are numerous advantages to building future microprocessors systems out of smaller processor building blocks:

  • Parallelism is a power-efficient way to achieve performance [Chandrakasan et al 1992].

  • A larger number of smaller processing elements allows for a fine-grained ability to perform dynamic voltage-scaling and power down. The processing element is easy to control through both software and hardware mechanisms.

  • A small processing element is an economical element that is easy to shut down in the face of catastrophic defects and easier to reconfigure in the face of large parametric variation. The Cisco Metro chip [Eatherton 2005] adds four redundant processors to each die, and Sun sells 4-processor, 6-processor, or 8-processor versions of Niagara based on the yield of a single 8-processor design. Graphics processors are also reported to be using redundant processors in this way.

  • A small processing element with a simple architecture is easier to functionally verify. In particular it is more amenable to formal verification techniques than complex architectures with out-of-order execution.

  • Smaller hardware modules are individually more power efficient and their performance and power characteristics are easier to predict within existing electronic design-automation design flows [Sylvester and Keutzer 1998] [Sylvester and Keutzer 2001] [Sylvester, Jiang, and Keutzer1999].

While the above arguments indicate that we should look to smaller processor architectures for our basic building block, they do not indicate precisely what circuit size or processor architecture will serve us the best. Above we discussed the fact that at certain inflection points, the development of technologies must move away from a simplistic “bigger is better” approach; however, the greater challenge associated with these inflection points is that the objective function forever changes from focusing on the maximization of a single variable (e.g. clock speed) to a complex multi-variable optimization function. In short, while bigger, faster, processors no longer imply “better” processors, neither does “small is beautiful” imply “smallest is best.”

CW #4 in section 2 states that the size of module that we can successfully design and fabricate is shrinking. In this section, we ask: What is the natural hardware building block for future multiprocessor systems? To address this question we will first consider the physical constraints of semiconductor processing. We will then examine the architectural implications of those constraints.

4.1.1 What are we optimizing?Physical Constraints of Semiconductor Processing


In the multi-variable optimization problem associated with determining the best processor building blocks of the future, it is clear that power will be a key optimization constraint. Power dissipation is a critical element in determining system cost because packaging and systems cooling-costs rise as a steep step-function of the amount of power to be dissipated. The precise value of the maximum power that can be dissipated by a multiprocessor system (e.g. 1W, 10W, 100W) is application dependent. Consumer applications are certain to be more cost and power sensitive than server applications. However, we believe that power will be a fixed and demanding constraint across the entire spectrum of system applications for future multiprocesors systems.
If power is a key constraint in the multi-variable optimization system, then we anticipate that energy/computation (e.g. energy per instruction or energy per benchmark) will be the most common dominant objective function to be minimized. If power is a limiting factor, rather than maximizing the speed with which a computation can be performed it seems natural to minimize the overall amount of energy that is used to perform the computation. Minimizing energy also maximizes battery life and longer battery life is a primary product feature in most of today’s mobile applications. Finally, while SPEC benchmarks [Spec 2006] have been the most common benchmarks for measuring computational and energy efficiency, we believe that the future benchmark sets must evolve to reflect a more representative mix of applications, and the Dwarfs should influence this benchmark set. .

As CW #1 and 2 in Section 2 state, power is proving to be the dominant constraint of present and future generations of processing elements. For example, the next generation Tejas Pentium 4 chips from Intel were canceled due to power consumption issues [Wolfe 2004]. Therefore, in examining the physical constraints imposed by semiconductor processing, we look first at power. Power dissipation in an integrated circuit consists of both static and dynamic factors. Static power dissipation is most directly related to leakage which is in turn related to the threshold voltage. Threshold voltage is the minimal voltage that needs to be reached for a transistor to turn on. The lower the threshold voltage of a circuit element, the higher the speed; however, static power dissipation grows exponentially with decreases in threshold voltage. The dynamic power of a logical element is linearly related to the capacitive charge/discharge of circuit elements and wiring. Clock and logical switching frequencies also result in linear increases in dynamic power; however, power supply voltage is quadratically related to dynamic power.


Leading-edge micro-architectures have been built from monolithic processing elements with increasingly deep pipelines, high clock frequencies, and relatively high supply voltages. Relatively high supply voltages means that as processing geometries scaled --typically multiplied by a scaling factor of 0.7--supply voltages were either not scaled (known as fixed-voltage scaling) or were scaled by smaller factors (known as general voltage scaling) [Rabaey et al 2003]. [Borkar 1999] and [Gelsinger 2001] warned that these approaches to maximizing clock speed had been pushed to their limit. As a result, the trend toward monolithic complex processor micro-architectures is reversing. As we reverse this trend toward complexity, it is natural to ask: Where do we stop? In other words, what micro-architecture for a processing element constitutes a power-speed sweet-spot?
To examine rising interconnect delay in application-specific integrated circuits (ASICs), researchers examined a related question: What is a sweet spot in power and delay for modules in ASICs? Using a variety of process and circuit models, [Sylvester and Keutzer 1998] suggests that a block size of 50K gates is a natural sweet spot for integrated circuits. The authors later adapted their results to micro-architecture [Sylvester and Keutzer 2001] but while they stopped short of offering detailed suggestions on micro-architecture design, it was noted that 50K could accommodate simple RISC processor and DSP processor cores (without caches). The assumptions of these papers can still be tested with the tool BACPAC [Sylvester, Jiang, and Keutzer1999].

4.1.2 What processing element is optimum?Micro-architectural Implications:


One key point of this section is that determination of the optimum processing element for a computation will entail the solution, or at least approximating the solution, of a multivariable optimization problem that is dependent on the application, environment for deployment, workload, and constraints of the target market. . Nevertheless, we also maintain that existing data is sufficient to indicate that simply utilizing semiconductor efficiencies supplied by Moore’s Law to replicate existing complex microprocessor architectures is not going to give an energy-efficient solution. In the following we indicate some general future directions for power and energy-efficient architectures.

One key point of this section is that determination of the optimum processing element for a computation will entail the solution, or at least approximating the solution, of a multivariable optimization problem that is dependent on the application, environment for deployment, workload, and constraints of the target market. Nevertheless, we also maintain that existing data is sufficient to indicate that simply utilizing semiconductor efficiencies supplied by Moore’s Law to replicate existing complex microprocessor architectures is not going to give an energy-efficient solution. In the following we indicate some general future directions for power and energy-efficient architectures.

Increasing parallelism increases capacitance (in the logic of the parallel structures) which

The effect of microarchitecture on power and performance was studied in [Gonzalez and Horowitz 1996]. Using power-delay product as a metric, the authors determined that simple pipelining is significantly beneficial to performance while only moderately increasing power. On the other hand, superscalar features adversely affected the power-delay product. The power overhead needed for additional hardware did not outweigh the performance benefits. Instruction level parallelism is limited, so microarchitectures attempting to gain performance from techniques such as wide issue and speculative execution achieved modest increases in performance at the cost of significant power overhead.


Analysis of empirical data on existing architectures gathered by Horowitz [Horowitz-2006], Paulin [Paulin 2006], and own investigations indicates that shallow pipelines (5 to 9) with in-order execution have proven to be the most power efficient. Given these physical and microarchitectural considerations, we believe the building blocks of future architectures are likely to be simple, modestly pipelined (5 to 9 stage) CPUs, FPUs, vector, and SIMD processing elements. Note that these constraints fly in the face of the conventional wisdom of simplifying parallel programming by using the largest processors available, and this gets to another important point of this section: Such a significant reduction in the size and complexity of the basic processor building-block of the future means that many more cores can be economically implemented on a single die. Rather than the number of processors-per-die scaling with Moore’s Law starting with systems of two processors (i.e. 2, 4, 8 and 16), we can imagine systems ramping up at the same rate but starting with a significantly larger number of cores (128, 256, 512 and 1024). Research on programming environments for multiprocessor systems must be reinvigorated in order to enable the most power and energy-efficient processor solutions of the future in turn increases dynamic power; however, it was pointed out in [Chandrakasan et al 1992] that the linear increase in power due to parallelism could be more than compensated for by reducing the supply voltage. In other words, parallelism is a power-efficient way to get performance, which is CW #10 in Section 2.
The effect of microarchitecture on power and performance is studied in more detail in [Gonzalez and Horowitz 1996]. Using power-delay product as a metric, the authors determined that simple pipelining is significantly beneficial to performance while only moderately increasing power. On the other hand, superscalar features adversely affected the power-delay product. The power overhead needed for additional hardware did not outweigh the performance benefits. Instruction level parallelism is limited, so microarchitectures attempting to gain performance from techniques such as wide issue and speculative execution achieved modest increases in performance at the cost of significant power overhead.
Given these physical and microarchitectural factors, we believe that a sweet spot for the basic building blocks for multiprocessors of the future is a 50K gate pipelined processing element. For applications with data and bit parallelism, we believe vector units and special purpose hardware will serve power and performance well by requiring fewer operations to do the same computational work. According to this criteria, the building blocks of future architectures are likely to be simple 5-stage CPUs, FPUs, vector, and SIMD processing elements. Note that these constraints trump the old conventional wisdom of simplifying parallel programming by using the largest processors so you need fewer of them.
Finally, a simple 50K gate processor core makes for a nice primitive element for providing redundancy. The Cisco Metro chip [Eatherton 2005] adds four redundant processors to each die, and Sun sells 4-processor, 6-processor, or 8-processor versions of Niagara based on the yield of a single 8-processor design.

4.1.3 Does one size fit all?


We would like to briefly consider the question as to whether multiprocessors of the future will be built as collections of identical processors or assembled from diverse heterogeneous processing elements. Existing multiprocessors, such as the Intel IXP network processing family keep at least one general purpose processor on the die to support various housekeeping functions. It is also capable of providing the hardware base for more general (e.g. LINUX) operating system support. Finally, keeping more a conventional processor on chip may help to maximize speed on “inherently sequential” code segments. Failure to maximize speed on these segments may result in significant degradation of performance due to Amdahl’s Law. Aside from these considerations, a single replicated processing element has many advantages; in particular, it offers ease of silicon implementation and a regular software environment. We believe that software environments of the future will need to schedule and allocate 10,000’s of tasks onto 1,000’s of processing elements. Managing heterogeneity in such an environment may make a difficult problem impossible. The simplifying value of homogeneity here is similar to the value of orthogonality in instruction-set architectures.
On the other hand, heterogeneous processor solutions can show significant advantages in power, delay, and area. Processor instruction-set configurability [Killian et al 2001] is one approach to realizing the benefits of processor heterogeneity while minimizing the costs of software development and silicon implementation, but per-instance silicon customization of the processor is still required to realize the performance benefit, and this is only economically justifiable for large markets.
Implementing customized soft-processors in pre-defined reconfigurable logic is another way to realize heterogeneity in a homogenous implementation fabric; however, current area (40X), power (10X), and delay (3x) overheads [Kuon and Rose 2006]) appear to make this approach prohibitively expensive for general-purpose processing. A promising approach that supports processor-heterogeneity is to add a reconfigurable coprocessor [Hauser and Wawrzynek 1997][Arnold 2005]. This obviates the need for per-instance silicon customization. Current data is insufficient to determine whether such approaches can provide power and energy efficient solutions.
Briefly, will the possible power and area advantages of heterogeneous ISA multicores win out versus the flexibility and software advantages of homogeneous ISAs? Or to put it another way: In the multiprocessor of the future will a simple pipelined processor be like a transistor: a single building block that can be woven into arbitrarily complex circuits? Or will a processor be more like a NAND gate in a standard-cell library: one instance of a family of hundreds of closely related but unique devices? In this section we do not claim to have resolved these questions. Rather our point is that resolution of these questions is certain to require significant research and experimentation, and the need for his research is more imminent than a multicore multiprocessor roadmap (e.g. 2, 4, 8, and 16 processors) would otherwise indicate.

In the multiprocessor of the future, will a simple pipelined RISC processor be like a transistor: a single building block that can be woven into arbitrarily complex circuits? Alternatively, will it be more like a NAND gate in a standard-cell library: one instance of a family of hundreds of closely related but unique devices?


A single processing element has many advantages; in particular, it offers ease of silicon implementation and a regular software environment. Heterogeneous processor solutions can show significant advantages in power, delay, and area, but have additional costs in software and silicon implementation. Managing a library of processors promises to be much more complicated than managing a library of logical cells.
Processor instruction-set configurability [Killian et al 2001] is one approach to realizing the benefits of processor heterogeneity while minimizing the costs of software development and silicon implementation, but per-instance silicon customization of the processor is still required to realize the performance benefit.
Another promising approach supports processor heterogeneity by adding a reconfigurable coprocessor, which obviates the need for per-instance silicon customization. [Hauser and Wawrzynek 1997][Arnold 2005] Implementing customized soft-processors in pre-defined reconfigurable logic is another way to realize heterogeneity in a homogenous implementation fabric; however, current area and power overheads appear to make this approach prohibitively expensive for general-purpose processing.


Download 232.56 Kb.

Share with your friends:
1   2   3   4   5   6   7   8   9   ...   13




The database is protected by copyright ©ininet.org 2024
send message

    Main page