Project summary


Programming and Compilation for ATAC



Download 131.61 Kb.
Page6/8
Date13.06.2017
Size131.61 Kb.
#20459
1   2   3   4   5   6   7   8

Programming and Compilation for ATAC


Given the multicore trend, all computer systems will be parallel systems; all programs will be parallel programs. Yet, it is still unclear how programmers will make the shift from sequential programming to efficient parallel programming. Accordingly, a key goal of the ATAC project is to enable programmers to easily and efficiently harness the system’s computational power by establishing a clear, high-level programming interface that makes use of the broadcast capability in the short term, and a language based on broadcast in the longer term.

Our programming and compilation effort will have two facets. The first will develop an API that exposes broadcast related primitives directly to the user. We will use the experience with this API to design a language (such as Thinking Machines’ C*) where the basic broadcast primitive will drive the language design. We have experience with designing APIs and languages for novel architectures. For example, in the Raw effort, we designed an API for streaming called libStream followed by the development of a language Streamit [34].

The ATAC programming model and APIs will allow users to easily write programs employing global operations such as broadcast, scatter/gather, and reduction via the optical broadcast network. Likewise, programmers will be able to use the electrical mesh network through the high-level programming interface as well. In fact, the interface will be able to transparently choose which network is used depending on the type of communication (e.g., single point-to-point message vs. chip-wide broadcast operation).

The second will provide ways of implementing existing architectural models that are known to be reasonably easy to program, but difficult to implement, such as large-scale coherent snooping caches along with a shared memory programming model, or message passing interface MPI along with multicast.

We will also experiment with hybrid approaches in which the programming model will support both shared memory and message passing modes, both build upon the underlying on-chip networks and in-core caches. The shared memory model will build upon novel cache coherence algorithms that use the broadcast network to efficiently communicate shared cache state. The message passing model will build upon existing message passing models such as MCAPI and MPI, with an emphasis on ease of programming with the addition of a set of broadcast primitives.

Measuring Programming Ease


Given that programming efficiency is such a key driver of the ATAC project, developing appropriate metrics for programming efficiency is important. As such, this project will develop metrics that assess “programming efficiency” and “ease of programming” by taking into account the difficulty of implementing a variety of algorithms with the ATAC programming model as well as baseline programming models such as a standard tiled multicore processor (e.g., Raw) or an MPI-based parallel system.

There are several possible ways of measuring the difficulty of implementing algorithms. One measure that has been used in the past is “lines of code”. Our implementations of applications in both the ATAC style and existing styles will directly yield the lines of code metric.

However, studies have shown [40] that Lines of Code is not a particularly good measure of programming ease. This is due to the fact that not all lines of code require the same amount of effort to write. For example, code that involves communication or synchronization between different processors is generally more difficult to write than a simple computational loop. “Lines of code” also does not capture the effort that the programmer had to expend deciding how to partition and spatially distribute an application. To measure the true benefit to ATAC programmers, more sophisticated measures will need to be created. These measures include a comparison of the actual time spent developing a given application in both programming models.

Our team has previous experience with experiments designed to measure programmer productivity. A study was conducted using the StreamIt programming language [33] where different programmers were given a variety of application development and debugging problems and the time it took them to reach a solution was measured. Similar studies might be a good way to more accurately measure the programming advantages of an ATAC system. One problem with this approach is this. Since programmers are given a problem and are asked to program it in both programming models, the second model tends to appear easier since the programmer is familiar with the problem the second time around.



A particularly appealing approach with normalizes out the “learning” bias of programmers is to divide the programmer group into two subgroups. The two subgroups are asked to implement the application in the two programming models in opposite orders, which tends to even out the learning bias. Thus, the programming time measured in this manner tends to be a good measure of programming ease.

Proof of Concept, Early Results and Current Status


To assess the performance characteristics of an ATAC multicore processor, we compare its performance and programmability to a leading-edge processor based on an electrical mesh network design similar to the MIT Raw processor. In general, the performance of the electrical processor is expected to be only slightly lower than the ATAC system if programming effort is not an issue. However, for applications and programming models that benefit from a fast broadcast capability, ATAC will yield a performance benefit. We estimate performance using an analytical model and also measure programming ease using lines of code.




64-core System

4096-core System




Mesh

ATAC

Mesh

ATAC

Theoretical Peak Performance

64 GFLOPS

64 GFLOPS

4 TFLOPS

4 TFLOPS

Actual Performance

7.3 GFLOPS

38 GFLOPS

0.22 TFLOPS

2.3 TFLOPS

Chip Power

24 W

22.7 W

140 W

155 W

Total System Power (CPU + DRAM + Optical Supply)

40 W

28 W

225 W

232 W

Total System Actual Power Efficiency

0.2 GFLOPS/W

1.4 GFLOPS/W

1.0 GFLOPS/W

9.9 GFLOPS/W

Table 1: Comparison of performance, power, and efficiency of ATAC and electrical-mesh processors. Results are presented for 64-core and 4096-core chips.

The ATAC processor is essentially the same as the baseline processor with the addition of an optical network to optimize global communications. Both processors have the same number of cores running at the same frequency and therefore have the same theoretical peak performance. However, the theoretical peak is unachievable on most applications, particularly for those applications that require large amounts of communication or coordination between the cores. A better way to compare performance is to measure useful operations performed while executing an actual application. Dividing the number of operations performed by the total time required to complete the application yields the effective performance. The effective performance numbers shown in Table 1 are based on a study of an abstracted coherent shared memory application (described in more detail in Applications Performance section below).

The actual performance of the ATAC processor is better than the baseline processor due to the increased efficiency of the global operations necessary to implement coherent shared memory. The processing cores in ATAC spend less time waiting for global communication operations to complete and therefore spend a greater fraction of their time doing useful work. Note that the difference between ATAC and the baseline is greater with a larger number of cores due to the distance-based communication costs on the electrical mesh. Even though the peak power consumption of the ATAC processor can be somewhat higher (due to the addition of the optical network), the actual power efficiency of an ATAC system is substantially better. This is due to two factors: 1) less energy is wasted waiting for global communication operations and 2) the availability of the broadcast allows a value fetched from DRAM to be sent to all the cores as opposed to having all the cores access DRAM for that value. This greatly reduces the number of DRAM accesses.

Applications Performance ATAC particularly helps those applications that have lots of global communication or irregular communication patterns. The reason is that electrical networks scale poorly in terms of communication bandwidth and coordination predictability.

ATAC performs well on global operations because of its highly efficient broadcast operations, which are an order of magnitude faster than a mesh-based multicore. Mesh-based multicores with 64 or more cores do not perform global operations efficiently because they have large, non-uniform core-to-core communication latencies. Furthermore, mesh based multicores exhibit a lot of contention during broadcast operations. ATAC, on the other hand, does not have either of these problems. Broadcast communication latency is not distance-dependent, but rather about 3ns for all communication, regardless of the endpoints. Its network’s WDM-nature enables contention-free broadcasts within a region, enabling many simultaneous broadcast operations. Furthermore, unlike typical mesh multicores, processor cores can consume messages immediately when they arrive; ATAC’s novel WDM and buffer scheme obviates message sorting and reassembly by the receiving core.

These features of ATAC translate into significant performance improvements. Eight applications were analyzed by estimating their running time on both a mesh-based multicore and ATAC for both 64 cores and 4096 cores. As seen in Figure 6, application performance improves by up to a factor of 80.

A
Figure 6: Comparison of performance between ATAC and electrical mesh for selected applications.


pplication speedups are calculated using analytical models based on application characteristics. These models include estimates of time spent in three categories: normal computation, core-to-core communication, and memory accesses. For all applications except Snooping Shared Memory, the models calculate the exact number of operations in each category required to execute the application on a specific problem size. The estimated costs of each operation are then summed to calculate the total run time. Care was taken to accurately model the operation of the communications networks, including any end-point contention. A range of problem sizes were examined and representative examples were chosen for the results shown.

The Snooping Shared Memory application represents an abstract application that performs computation and makes memory accesses that sometimes result in cache coherence operations. The model includes probabilities that each instruction of an application will induce different types of coherence traffic. Modeled operations include reads and writes to a local region of cores as well as the entire chip. The time required to resolve these operations is added to the time required to execute the instructions.



Ease of Programming The primary goal of the ATAC project is to develop ways in which high performance can be achieved on ATAC with only modest programmer effort.


Figure 7: Lines of code required to implement four algorithms on ATAC vs. a mesh-based multicore.

Lines of code” is one possible measure of programmer effort that has been used in the past, and we present some early results based on that metric. Although we realize that it is not the best measure of programming ease, it was somewhat easier to measure for our initial results than the more sophisticated methods mentioned in our proposed research such as programming time. Lines of Code measures the amount of code that a programmer needs to write to implement a particular application. Figure 7 compares ATAC with a mesh-based multicore architecture for four applications, vector addition, jacobi relaxation, leader election, and barrier synchronization. Even on these relatively simple applications, ATAC’s code size relative to the same programs written for a mesh is smaller.


Download 131.61 Kb.

Share with your friends:
1   2   3   4   5   6   7   8




The database is protected by copyright ©ininet.org 2024
send message

    Main page