University of wisconsin madison

Download 0.61 Mb.

Page	22/29
Date	13.05.2017
Size	0.61 Mb.
	#17847

1 ... 18 19 20 21 22 23 24 25 ... 29

Chapter 6
6.1Processor Architecture

5.5

Figure 5.29 Activity of hardware assists over the simulation time

Related Work on Hardware Assists for DBT

In Section 5.4 we discussed on-the-fly profiling, which is an important part of VM DBT systems both for identifying hotspot code and for assisting with certain optimizations. This section discusses related work on hardware assists for binary translation/optimization.

The Transmeta Efficeon designers implemented an execute instruction that allows native VLIW instructions to be constructed and executed on-the-fly [83]. This capability was added to improve the performance of the CMS interpreter. At the pipeline backend, there are two execution units added. However, details about the new execution unit design are not published. In contrast, we propose special hardware assists to accelerate the BBT translation and then save the translated code in a code cache for reuse.

The fill unit [50, 95] is one of the early hardware proposals for runtime translation or optimization. A fill unit is a non-architected transparent module that constructs a translation unit (typically, trace) and then performs limited optimization. The Intel Pentium 4 processor trace cache [58] is an implementation of a similar hardware scheme.

The Instruction Path Coprocessor [25] is a programmable coprocessor that optimizes a core processor’s instructions to improve execution efficiency. The coprocessor is demonstrated to perform several common dynamic optimizations such as trace formation, trace scheduling, register-move optimization and prefetching. A later proposal, PipeRench implementation of the instruction path coprocessor [26], reduces the complexity of the coprocessor design.

Dynamic strands [110] employ a hardware engine to form ILDP strands [77] for a strand aware engine. CCG [27] and many other research proposals [2, 104, 114] assume a hardware module to perform runtime code optimization.

The rePLay [104] and PARROT [2] projects employ a hardware hotspot detector to find program hotspots. Once a hotspot is detected, it is optimized via hardware and stored in a small on-chip frame/trace cache for optimized hotspot execution. The hardware optimizer is a coprocessor-like module located after the retirement stage of the main processor pipeline. As hotspot instructions are detected and collected in a hotspot buffer, the optimizer can execute a special program, perhaps stored in an on-chip ROM similar to the micro-code stores [58, 74]. This special program is written in the optimizer’s small instruction set that consists of highly specialized operations such as pattern matching and dependence edge tracking etc.

In our research, we explore hardware assists that are integrated into the co-designed processor pipeline. This requires simpler hardware than a full-blown coprocessor or hardware optimizer. Yet, the software translator owns the full programmability of the main processor’s ISA. The translator runtime overhead is mitigated via the combination of strategies discussed in Chapter 3 and simple hardware assists discussed in this chapter. Furthermore, in our VM system, translated code is held in a large main memory code cache.

Chapter 6 Putting It All Together: A Co-Designed x86 VM

In the previous three chapters, we explored and achieved promising results for some major components of a co-designed VM system, namely, a DBT translation strategy, translation software algorithms, and translation hardware assists. However, individual component designs do not necessarily combine together to produce an overall efficient system design. To demonstrate a processor design paradigm, the most convincing evidence is a complete, integrated processor design that provides superior performance and efficiency.

In this chapter, I explore system level design through synergetic integration of the technologies explored in the previous chapters. The detailed example design must tackle many thorny challenges for contemporary processors. Section 6.1 discusses high level trade-offs for the co-designed x86 processor architecture. Section 6.2 details the microarchitecture of the macro-op execution engine. Section 6.3 evaluates the integrated, hardware/software co-designed x86 processor within the limit of our x86vm framework. Section 6.4 compares this design with the related real world processor designs and research proposals.

6.1Processor Architecture

There are a few especially difficult issues that pose challenges to future processor designs. First, processor efficiency (performance per transistor) is decreasing. Although performance is improving, but the improvements are not proportional to the amount of hardware resources and complexity invested. Additionally, the complexity of the design process itself is increasing. Time-to-market tends to be longer today than a decade ago.

Second, power consumption has become critical. Power consumption not only affects energy efficiency, but also affects many other cost aspects of system designs, for example, the extra design complexity for the power distribution system, cooling, and packaging. High power consumption also leads to thermal issues that affect system reliability and lifetime.

Third, legacy features are making processor designs less efficient. As the x86 instruction set has become the de facto standard ISA for binary software distribution, efficient x86 design is becoming more important. Moreover, as with most long term standard interfaces, the x86 contains legacy features that complicate hardware design if implemented directly in hardware. Many of these legacy features are rarely used in modern software distributions.

The primary objective for the example co-designed x86 processor presented here is an efficient design that brings higher performance at lower complexity. The overall design should tackle the design challenges listed above. Furthermore, for practical reasons, it is especially valuable for a design team to consider a design that carries minimum cost and risk for an initial attempt at a new implementation paradigm.

Clearly, an enhanced and more efficient dynamic superscalar pipeline design is an attractive direction to explore. On one hand, dynamic superscalar is the best performing and the most widely microarchitecture for general purpose computing. On the other hand, dynamic superscalar processors are challenged by bottlenecks in several pipeline stages: for example, the fetch mechanism, the issue logic, and the execution/result forwarding datapath.

I propose and explore the macro-op execution microarchitecture (detailed in Section 6.2) as an efficient execution engine. It simplifies and streamlines the critical pipeline stages given in the preceding paragraph. It is not only an enhanced, but also a simplified dynamic superscalar micro-architecture that implements the fusible ISA for efficiency. Because it is a simplified superscalar pipeline, the hardware design is quite similar to current processor designs. Then, the remaining major processor design trade-off centers on dynamic binary translation which performs the mapping from the x86 ISA to the fusible ISA.

Figure 6.30 HW/SW Co-designed DBT Complexity/Overhead Trade-off

Figure 6.1 illustrates the trade-off between overall system complexity and DBT runtime overhead. A state-of-the-art VM design should use the software solution for DBT to emphasize simplicity for CPU intensive workloads. A conventional high performance x86 processor design might select the hardware solution to avoid bad-case scenarios mentioned in Chapter 1.

The preceding chapters have developed the key elements for a hardware/software co-designed DBT system. In fact, there are multiple possible hardware assisted DBT systems and this is reflected in Figure 6.1. For example, Chapter 5 proposed and evaluated two design points, dual mode decoders at the pipeline front-end and special functional unit(s) at the pipeline backend. Either of these can provide competitive startup performance as illustrated in Chapter 5.

For the example x86 processor design, we choose the dual mode decoders because it has very competitive startup performance to conventional x86 processor designs. Furthermore, it provides a more smooth transition from conventional designs to the VM paradigm. The downside is that this design point does not remove as much hardware complexity as the backend functional unit(s) solution. However, it does provide the same hotspot optimization capability as other VM design points. And these runtime optimizations would be very expensive if implemented in hardware.

Because the dual mode decoders have two modes for x86 and fusible ISA instructions, the macro-op execution pipeline is conceptually different for x86 mode and macro-op mode (Figure 6.2). The difference is due to the extra x86 vertical decode stages which first identify individual x86 instructions, decode, and crack them into fusible ISA micro-ops. The VMM runtime controls the switch between these two modes. Initially, all x86 software runs through the x86 mode pipeline. Once hotspots are detected, the VMM transfers control to the optimized macro-op code for efficiency.

Figure 6.31 Macro-op execution pipeline modes: x86-mode and macro-op mode

The x86 mode pipeline is very similar to current high performance x86 processors, except that the instruction scheduler is also pipelined for higher clock speed and reduced scheduler complexity. Compared with the macro-op mode pipeline, the additional x86 mode pipeline stages consume extra power for the x86 “vertical” decode and charge more penalty cycles for branch mispredictions.

The macro-op mode pipeline is an enhanced dynamic superscalar processor for performance and efficiency. The VMM runtime software and the code cache which holds translated and optimized macro-op code for hotspots occupy and conceal a small amount of physical memory from the system. However, because the dual mode co-designed processor does not need to handle startup translations, there is no need for a BBT code cache, which is much larger than the hotspot code cache. The details of the macro-op execution pipeline are expanded upon in the next section.

Download 0.61 Mb.

Share with your friends:

1 ... 18 19 20 21 22 23 24 25 ... 29

University of wisconsin madison

5.5 Figure 5.29 Activity of hardware assists over the simulation time Related Work on Hardware Assists for DBT

Chapter 6

Putting It All Together: A Co-Designed x86 VM

6.1Processor Architecture

5.5

Figure 5.29 Activity of hardware assists over the simulation time

Related Work on Hardware Assists for DBT