The left tower of Figure 1 is applications. In addition to traditional desktop, server, scientific, and embedded applications, the importance of consumer products is increasing.
The conventional pathway to guide and evaluate to architecture innovation is to study a benchmark suite based on existing programs, such as SPEC or EEMBClike EEMBC or SPEC to guide and evaluate innovation. A problem for innovation in parallelism is that it’s unclear today how to express it best. Hence, it seems unwise to let a set of existing programs drive an investigation into parallel computing computing as the requirements are too application-specific. There is a need to find a higher level of abstraction for reasoning about parallel application requirements.
We decided to mine the parallelism experience in parallelism of the high-performance computing community to see if there are lessons we can learn for a broader view of parallel computing. The hypothesis is not that traditional scientific computing is the future of parallel computing; it’s that the body of knowledge created in building programs that run well on massively parallel computers may prove useful elsewhere. Furthermore, many of the authors from other areas, such as embedded computing, were surprised at how well future applications in theire scientific computing domain mapped closely to problems in scientific computing.
For instance, computer gaming has become a stroing driver for improving desktop computing performance to teraflop scale.. While the race to improve realism has pushed graphics processing unit (GPU) performance up into the Teraflops range (in single-precision), graphical realism is not isolated to drawing polygons and textures onto the screen. Rather, modeling of the physical processes that govern the behavior of these graphical objects requires many of the same computational models that are used in practice for large- scale scientific simulations . (http://graphics.stanford.edu/~fedkiw/). The same is true for many applications in Computer Vision and media processing that form the core of the “applications of the future” that are driving the technology roadmaps of hardware vendors.
Our goal is to delineate application requirements in a manner that is not overly application-specific so that we can draw broader conclusions about hardware requirements. Ourne approach, described below, is to define a number of “dwarfs”, which each capture a pattern of computation and communication common to a class of important applications. equivalence classes that correspond, in broad terms, to applications that exhibit similar patterns of computation and communication. Our investigation led us to consider a number of different ways of defining these “equivalence classes,” which are described below.
3.1 Seven Dwarfs
The conventional path to architecture innovation is to study a benchmark suite like EEMBC or SPEC to guide and evaluate innovation. A problem for innovation in parallelism is that it’s unclear today how to express it best. Hence, it seems unwise to let a set of existing programs drive an investigation into parallel computing.
We decided to mine the experience in parallelism of the high-performance computing community to see if there are lessons we can learn for a broader view of parallel computing. The hypothesis is not that traditional scientific computing is the future of parallel computing; it’s that the body of knowledge created in building programs that run well on massively parallel computers may prove useful elsewhere. Furthermore, many of the authors from other areas, such as embedded computing, were surprised at how well future applications in their domain mapped closely to problems in scientific computing.
We were inspired by the work of For example, Phil Colella, who identified seven numerical methods that that he believed will be important for science and engineering for at least the next decade [Colella 2004]], which are introduced in Figure 3. The idea is that programs that implement these numerical methods may change, but the methods themselves will remain important.The Seven DwarvesDwarfs, introduced in Figure 3, constitute an equivalence classes where the membership in thea class is defined by the similarity in computation and data movement. (where data movement includes both memory access pattern and communication pattern). They provideThe dwarfs are specified at a higher -level of abstraction forto allow reasoning about their application requirementsbehavior across a broad range of applications. Programs that are members of a particular equivalence class can be implemented differently and the underlying numerical methods may change over time, but ourthe claim is that the underlying patterns have remainpersisted through generations of changes and will remain important into the future..
Some evidence for the Figure 3 describes Colella’s “Seven Dwarfs,” with examples of programs that implement them, the corresponding NAS benchmarks, and examples of computers that run the dwarfs well.
Our plan to learn from high performance computing led us to investigate Colella’s claims as follows:
-
How well do the seven dwarfs of high performance computing capture the computation and communication of a broader range of computing?
-
What dwarfs need to be added to cover the missing important areas beyond high performance computing?
As long as the final list contains no more than two- or three-dozen dwarfs, architects and programming model designers can use them to measure success. For example, SPEC2006 has 29 benchmarks and EEMBC has 41.
The existence ce of this particular set of “equivalence classes” can be found existence of in the numerical libraries that arehave been built around these equivalence classes (e.g. FFTW for spectral methods, LAPACK/SCALAPACK for dense linear algebra, and OSKI for sparse linear algebra), which we list in Figure 3, together with the . Purpose-built computer architectures that have been purpose-built forthat are focused on pparticular dwarvesdwarfs (described in Figure 3), such as(e.g., GRAPE for particle/Nn-body methods[citation?], vector architectures for linear algebra [Russel 1976], and FFT accelerators [citation?]) provides some weak evidence of the existence of these equivalence classes from the hardware architecture perspective. Figure 3 We also were also surprisedshows to find that the inter-processor communication patterns exhibited by members of a d“dwarf” when running on a parallel machine category were very closely related (also shown in Figure 3) [Vetter and McCracken 2001] [Vetter and Yoo 2002] [Vetter and Meuller 2002] [Kamil et al 2005]. The communication pattern is closely related to the memory access pattern that takes place locally on each processor.
The dwarfs present a method of reasoning about common requirements of classes of applications that are reasonably divorced from individual implementations and from system concurrency. If the dwarfs are generally more broadly applicable, we can to use them to guide innovation and evaluation of new prototypes. Ideally, if new architectures and programming models run all the dwarfs well, then applications of the future will run efficiently on manycore systems.
Although the nomenclature of the Dwarves comes from Phil Colella’s discussion of scientific computing applications, our plan to apply dwarfs to a broader array of computational methods led us to investigate Colella’s claims as follows:
-
How well do the seven dwarfs of high performance computing capture the computation and communication of a broader range of computing?
-
What dwarfs need to be added to cover the missing important areas beyond high performance computing?
As long as the final list contains no more than two- or three-dozen dwarfs, architects and programming model designers can use them to measure success. For example, SPEC2006 has 29 benchmarks and EEMBC has 41.
Dwarf
|
Description
|
Communication Pattern
|
NAS Benchmark / Example HW
|
1. Structured Grids
(e.g., Cactus or Lattice-Boltzmann Magneto- hydrodynamics)
|
Represented by a regular grid; points on grid are conceptually updated together. It has high spatial locality. Updates may be in place or between 2 versions of the grid. The grid may be subdivided into finer grids in areas of interest (“Adaptive Mesh Refinement”); and the transition between granularities may happen dynamically.
|
Communication pattern for Cactus, a PDE solver using 7-point stencil on 3D block-structured grids.
|
Multi-Grid, Scalar Penta-diagonal / QCDOC[Edinburg 2006], BlueGeneL
|
2. Unstructured Grids (e.g., ABAQUS or FIDAP)
|
An irregular grid where data locations are selected, usually by underlying characteristics of the application. Data point location and connectivity of neighboring points must be explicit. The points on the grid are conceptually updated together. Updates typically involve multiple levels of memory reference indirection, as an update to any point requires first determining a list of neighboring points, and then loading values from those neighboring points.
|
|
Unstructured Adaptive /
Vector computers with gather/scatter, Tera Multi Threaded Architecture
[Berry et al 2006]
|
3. Spectral Methods (e.g., FFT)
|
Data are in the frequency domain, as opposed to time or spatial domains. Typically, spectral methods use multiple butterfly stages, which combine multiply-add operations and a specific pattern of data permutation, with all-to-all communication for some stages and strictly local for others.
|
PARATEC: The 3D FFT requires an all-to-all communication to implement a 3D transpose, which requires communication between every link. The diagonal stripe describes BLAS-3 dominated linear-algebra step required for orthogonalization.
|
Fourier Transform / DSPs, Zalink PDSP [Zarlink 2006]
|
4. Dense Linear Algebra (e.g., BLAS or MATLAB)
|
Data are dense matrices or vectors. (BLAS Level 1 = vector-vector; Level 2 = matrix-vector; and Level 3 = matrix-matrix.) Generally, such applications use unit-stride memory accesses to read data from rows, and strided accesses to read data from columns.
|
T he communication pattern of MadBench, which makes heavy use of SCLAPACK for parallel dense linear algebra, is typical of a much broader class of numerical algorithms
|
Block Triadiagonal Matrix, Lower Upper Symmetric Gauss-Seidel /
Vector computers,
Array computers
|
5. Sparse Linear Algebra (e.g., SpMV, OSKI, or SuperLU)
|
Data sets include many zero values. Data is usually stored in compressed matrices to reduce the storage and bandwidth requirements to access all of the nonzero values. Because of the compressed formats, data is generally accessed with indexed loads and stores.
|
S uperLU (communication pattern pictured above) uses the BCSR method for implementing sparse LU factorization.
|
Conjugate Gradient / Vector computers with gather/scatter
|
6. Particle Methods (e.g., GTC, the Gyrokinetic Toroidal Code, for Barnes-Hut, Fast Multipole)
|
Depends on interactions between many discrete points. Variations include a) particle-particle methods, where every point depends on all others, leading to an O(N2) calculation; and particle-in-cell, where all of the particles interact via a regular grid, leading to simpler, more regular calculations. Hierarchical particle methods combine forces or potentials from multiple points to reduce the computational complexity to O(N log N) or O(N).
|
G
TC presents the typical communication pattern of a Particle-in-Cell (PIC) code. PMEMD’s communcation pattern is that of a particle mesh Ewald calculation.
|
(no benchmark) / GRAPE,
[Tokyo 2006]
MD-GRAPE
[IBM 2006]
|
7. Monte Carlo
|
Calculations depend on statistical results of repeated random trials. Considered embarrassingly parallel.
|
Communication is typically not dominant in MonteCarlo methods.
|
Embarrassingly Parallel /
NSF Teragrid
|
Figure 3 Seven dwarfs, their descriptions, corresponding NAS benchmarks, and example computers. The matrix describing the “communication pattern” shows areas of high communication volume between processors in white.
|
|
Share with your friends: |