The ATAC approach incorporates an optical broadcast network into a tiled multicore processor architecture. The optical network is enabled by recent advances in electronic-photonic integration using standard CMOS. ATAC also seamlessly integrates an electrical mesh interconnect for high-bandwidth near-neighbor communication. Programmers will interact with ATAC via high-level APIs that leverage the system’s underlying resources. The following sections discuss details of the ATAC architecture, optical network, and software infrastructure.
The Basic ATAC Architecture
Figure 3: ATAC architecture for a 4096-core microprocessor chip.
The ATAC processor architecture is a tiled multicore architecture combining the best of current scalable electrical interconnects with cutting-edge on-chip optical communication networks. The tiled layout uses a 2-D array of simple processing cores, each containing a multiple-instruction-issue in-order RISC pipeline with an FPU and local memories. Each tile contains an L1 cache and a portion of the distributed L2 and L3 caches. The L1 and L2 caches are SRAMs while the L3 cache is embedded DRAM. Chip resources are divided approximately evenly between computation, communication, and memory (one-third to each) which has been shown to be near-optimal for multicore processors [32].
One of the important and appealing aspects of our design is that we use a modest clock speed of 1GHz for the processors, electrical network, and the optical network. Although optical networks can be clocked at much higher speeds, the power consumption at the endpoint transducers and interface circuitry can be prohibitive. Since our optical network is being used only for broadcast it has modest bandwidth requirements and we do not need to resort to clock speeds that are much higher than the base processor frequencies.
Each of the cores is connected to its four nearest neighbors using point-to-point electrical connections. The sum of all the electrical links is a complete mesh network (the “ENET” indicated in Figure 3) capable of transferring data between any two cores using multiple hops. On top of this state-of-the-art electrical substrate, ATAC adds an integrated photonic communications network to improve the performance and efficiency of operations that are costly when using the electrical mesh. These operations include broadcast/multicast communication, as well as point-to-point communication between cores that are physically far apart.
The heart of the on-chip optical interconnect is the all-to-all network (or “ANET”). The ANET provides a low-latency, contention-free connection between a set of optical endpoints, as depicted for 64 tiles in the center part of Figure 3. This highly interconnected topography is achieved using a set of optical waveguides that visit every endpoint and loop around on themselves to form continuous rings. Further, as illustrated in the right side of Figure 3, each sending endpoint can place data onto a waveguide using an optical modulator (shown as a yellow circle on each of the waveguides) and receive data from the other endpoints using optical filters and photodetectors (shown as a red circle). Because the data waveguide forms a loop, a signal sent from any endpoint will quickly reach all of the other endpoints. Thus, every transmission on the ANET has the potential to be a fast, efficient broadcast. To avoid having all of these broadcasts interfere with each other, the ANET uses wavelength division multiplexing (WDM). The processor cores in the ATAC architecture have a 32-bit word size making it desirable for them to be able to send a 32-bit word on each clock cycle. This is accomplished using a set of parallel waveguides where each waveguide carries one bit. In a baseline ATAC processor, there would be 32 waveguides, each transmitting data at the same frequency as the processor core. If chip real-estate for optical components is limited (as might be the case when scaling up to thousands of cores) serialization can be used to decrease the number of waveguides. In other words, we can reduce the number of waveguides to 16, 8 or even 4, and serialize a 32-bit word using multiple sub-word transfers.
In addition to the primary data waveguides, there are several other special-purpose waveguides. First, there is an optical “power supply” waveguide that provides the light source for the modulators. Second, there is a clock waveguide which sources use to send the clock along with the data. Third, there is a backwards flow-control waveguide that is used to throttle back a sender when a receiver is overwhelmed. Finally, we are exploring the use of several “metadata” waveguides that are used to indicate a message type (e.g., cache read, cache write, barrier, ping and raw data) or a message tag (for disambiguating multiple message streams from the same sender).
In the WDM design, all the modulators on a given sender are tuned to transmit at a unique wavelength. To receive data from any sender at any time, each receiving endpoint must contain sets of filters trimmed to the wavelengths of each of the other endpoints’ modulators. Each set of filters then feeds into a separate FIFO (First-In, First-Out buffer), allowing the data from each sender to be buffered separately. This saves the processor core from the extra step of examining each message to determine the sender and find the message it needs. Since every receiver is not necessarily interested in every message sent on the network, special-purpose hardware is used to pre-screen messages and forward only messages of interest to the FIFOs. This avoids the extra energy associated with buffering and handling unwanted data and allows the FIFOs to be kept smaller. It also frees the processing core from having to sort through messages using software. Messages can be screened based on sender (i.e., wavelength) or by the metadata transmitted with each message. This novel buffering and filtering scheme is an area of active research for this project.
The design of a single ANET optical link scales to approximately 64 endpoints. Therefore a 64-core chip (feasible using a 90nm or 65nm CMOS process) could simply place a single core at each ANET endpoint. Scaling beyond this point requires some number of cores to share a single optical endpoint. The set of cores sharing one endpoint is referred to as a “region.” For chip designs requiring only a small number of cores in each region, electrical circuits can be used to negotiate access and distribute incoming data, preserving the illusion that all cores are optically connected. As regions grow larger, it is preferable to use a ring of rings optical architecture as illustrated in the left hand side of Figure 3. In this design, we use an optical network within each region and create a two-level hierarchical design. In this design (shown in Figure 3), there is a top-level ANET (ANET1) that connects together multiple regional ANETs (ANET0). On the ANET0, there is a single core in each region. The cores connected together by each ANET0 form a region on the ANET1. Our analysis indicates that this two-level design will scale to 4096 nodes at the 11nm CMOS technology point.
As described in more detail in a later section, the ANET0 and ANET1 networks are connected by an interface tile using conventional electronics. Our design allows for a pair of values to be transmitted simultaneously between the two levels. Thus, at any given instant two broadcasts can be happening simultaneously over all 4096 cores. However, 64 simultaneous broadcasts can be happening within each 64-core regional network or ANET0.
Seamless connections to external DRAM or I/O devices are made by replacing some processing cores with memory or I/O controllers. Processing cores access off-chip resources by sending messages to these gateways. Memory cores receive requests on the ATAC network, interpret the requests electrically, and then send messages to DRAM using a separate waveguide that goes off-chip. We clock this waveguide at 2GHz. Data is transmitted on this waveguide using 64 different wavelengths to send 64 bits at a time. Replacing 4 cores in each 64-core region with memory controllers yields a memory-bandwidth-to-computation ratio of 1 byte/FLOP (assuming a 2GHz clock for the waveguides going off-chip). A 4096-core chip would require a reasonable 256 memory connections (supplying 4 TB/s bandwidth) to achieve the same ratio.
However, because the massive on-chip bandwidth of the combined electrical and optical networks encourages communication-centric rather than memory-centric computing, the traditional rule-of-thumb ratio of 1 byte/FLOP is excessive. Communication-centric computing allows processes to exchange values directly rather than storing them in memory as an intermediary. In addition, ATAC’s efficient broadcasts act as DRAM bandwidth multipliers, allowing each value fetched from DRAM to be received by multiple cores. Together, these effects can lower memory bandwidth requirements significantly.
The area required to implement the computational portion (included L1 and L2 caches) of 4096 cores is approximately 400 mm2. Using an additional 200 mm2 to implement an L3 cache using embedded DRAM will allow for over 8 GB of on-chip memory. Because communication-centric computing reduces the pressure on all levels of the memory hierarchy, this amount should be sufficient for most application domains. If additional on-chip capacity is desired, 3D integration of chips can be used to stack regular DRAM above each tile, allowing even more total “on-chip” memory.
Share with your friends: |