Parallel programming is a challenging problem [41]. In order for existing sequential codes to take advantage of multicores, programming tools and novel architectures are needed to ease the transition from sequential codes to multicore codes [42]. Much academic and industry research has gone into attempting to ease the effort required for executing codes on parallel computers such as parallel programming APIs, domain specific languages, automatic parallelizing compilers, parallel performance tools, and incremental architectural enhancements to ease programming but with little impact. We believe bold architectural approaches are needed to address this pressing issue.
There are many extensions to sequential programming languages to provide parallel capabilities. Examples include threads [43], OpenMP [44], and MapReduce [45] MPI [46]. These methods work well for some set of applications, but none has shown itself capable of tackling all forms of parallelism. Many of the methods such as threads do not scale beyond a few cores. Also, some researchers believe that these extensions allow the programmer to easily introduce program errors; Lee [47] shares this view. MPI is difficult to program, and it squanders the advantage of multicore with its high operation overhead. All-to-All broadcast using optical interconnect addresses both the scalability and programmability issue.
Domain specific languages also attempt to address multicore programmability. StreamIt [48] and Brook [49] are programming languages primarily focused on signal processing and stream processing.
Parallelism can also be extracted from sequential codes. This allows the programmer to not modify their sources while still realizing performance improvements on parallel machines. Typically these are modest gains and unlikely to scale to more than 10 or 20 cores. Examples of distributed ILP (DILP) compiler efforts include Mahlke’s work [50], GREMIO [51], and our own effort on RawCC [9].
In order to ease multicore programming, we have seen the advent hardware being added to provide easier programming models. An example of this additional hardware is transactional memory systems [52, 53]. Transactional memory systems allow multiple threads to access shared memory inside of a transaction. If multiple threads access the same piece of data, then the system rolls back the transactions such that only one modifies the shared data at a time. It is conjectured that this is easier to program and less error prone than threaded programs with locks, but needs to be investigated further.
Different architectures attempt to solve the problem of organizing and connecting parallel resources on a single chip. One manner to do this is via processors designed to exploit instruction level parallelism. The Itanium processor [54] and Multiflow work [55] are examples of this. Streaming processors have attempted to solve this problem by optimizing for applications with little temporal locality. The prototypical stream processor is the Imagine processor [56]. Another organization of parallel resources is to build a SMP on a chip. The Piranha project [57] is an example of a SMP on a chip for commercial work-loads. Finally there are processors which support multiple types of parallelism, examples being Trips [58] and Raw processors which can support stream processing, thread level parallelism, and ILP.
Although other methods of using photonics for on-chip interconnect are being developed, their limited gains versus electrical interconnect rarely justify their added cost and complexity. Free space optics has attracted significant interest because it offers flexibility in terms of hybrid components. The downside lies in its limited CMOS compatibility and fabrication of reliable optical components. Another approach replaces the electrical bus with an optical bus. Unfortunately, contention still limits the optical bus. We believe that our approach using WDM, CMOS process compatibility, and broadcast uniquely leverages the strengths of photonics for the specific goal of enhancing multicore programmability in an area where electrical interconnect falls short, thereby justifying its use.
Research Questions
The aim of the ATAC project is to create a multicore computer system that can scale to thousands of cores, both in terms of performance and ease of programming. ATAC attempts to achieve this goal by integrating an optical broadcast network into a tiled multicore processor with an electrical mesh network. ATAC will also provide the programmer with high-level APIs to efficiently and effectively exploit ATAC’s hardware resources. The research questions for the ATAC project center around how to best achieve and balance the two goals of performance and programmability. More specifically, our research will attempt to answer the following questions:
-
What is the best way to interface an optical broadcast network in a basic tiled multicore processor?
-
What is the right balance in the power budget between energy used in the electrical mesh network and energy used in the optical broadcast network? Furthermore, what is the right balance between the power consumption of such communication networks, the computational part of each core, and the on-chip and off-chip memory resources?
-
What is the best API to provide programmers to take advantage of both the optical broadcast network and the electrical mesh? Should the API’s allow the programmer to observe or control the spatial location where a particular piece of computation is run, or should the API handle this behind the scenes? Are there high-level language constructs that would help programmer productivity?
-
What is the best way to measure ease of programming and programmer productivity?
-
A comparison of performance and programming effort for the baseline tiled multicore architecture, versus the architecture with the optical network.
-
What is the degree to which the broadcast network can make programming easier for a given level of performance? Or conversely, what is the degree to which performance can be increased with a broadcast network for a given level of programming effort? This result will provide the justification to take the next step of actually building a physical prototype of the ATAC processor.
-
We will also study the extent to which a pipelined broadcast can be implemented in software in a traditional electrical network, and assess the performance achievable.
-
Which application classes will best take advantage of ATAC’s broadcast network?
-
What is the best way to simulate 1000 cores at reasonable speeds and with sufficient flexibility?
Share with your friends: |