University of wisconsin madison


Chapter 5 Hardware Accelerators for x86 Binary Translation



Download 0.61 Mb.
Page19/29
Date13.05.2017
Size0.61 Mb.
#17847
1   ...   15   16   17   18   19   20   21   22   ...   29

Chapter 5

Hardware Accelerators for x86 Binary Translation


As discussed in Chapter 3, the x86vm translation strategy includes new primitives and assists to accelerate the critical part of the VM runtime software, especially for BBT translation and hotspot detection. In this Chapter, we propose two new hardware accelerators for BBT, one at the pipeline front-end (Section 5.1) and the other at the pipeline backend (Section 5.2). Hardware assists for hotspot detection and profiling are described in related work. In Section 5.3, we discuss how these assists help co-designed VM systems in particular. The proposed hardware assists are evaluated in Section 5.4. Section 5.5 surveys related work.

5.1Dual-mode x86 Decoder


In conventional x86 processor designs, x86 instructions are decoded into RISC-style operations called micro-ops/uops. Although this hardware translation cannot achieve the powerful translation and optimizations we propose for hotspot translation, this mechanism can perform the simple translation sufficient for program startup phases.

In contrast, most start-of-the-art co-designed VM systems run a slow software interpreter or simple BBT translator to get beyond program startup phases and then count on optimized hotspot code to compensate for the startup performance loss. The extra software runtime overhead may not always be paid back (at least for some of the scenarios discussed in Chapter 1).

We propose a method for collecting the advantages of both types of systems: the solid startup performance of conventional x86 processors, and the flexible advanced software translation for hotspot performance optimization. The key is a mechanism that seamlessly combines these two execution modes (dual mode) together.

The key to the method lies in the decoder stage. For a macro-op execution pipeline, fusible ISA instructions can be decoded by simple RISC-style decoders. However, for x86 (CISC) processor pipelines, the x86 instructions go through a two-level decoding process. The first level decoder identifies x86 instruction boundaries and cracks x86 instructions into “vertical” RISC-style micro-ops [119]. Then, a second level decoder generates the “horizontal” decoded signals and controls used by the backend pipeline. The second level decoder is in fact very similar to the RISC-style decoders in our macro-op execution microarchitecture. A two-level decoder is especially suited to a CISC ISA because complex CISC instructions must both be decomposed (cracked) into RISC-style micro-ops and be decoded into pipeline control signals.






Figure 5.22 Dual mode x86 decoder

The two-level decoder was first published for the microcode engine used in the Motorola 68000 and has been deployed by most modern CISC processors. Our macro-op pipeline leverages this two-level decoding approach and employs a dual mode (two-level) decoder (Figure 5.1) that targets CISC ISA(s) in particular. The first level of the dual mode decoder identifies x86 instruction boundaries and cracks x86 instructions into “vertical” RISC style micro-ops. However, the dual mode aspect dictates that the RISC micro-ops are in the same 16-bit/32-bit micro-op format as for the fusible ISA. The second level of dual mode decoder then generates conventional “horizontal” decoded control signals. To this structure, we add a bypass path (Figure 5.1) around the first level decoder, which enables the decoder to be used in dual modes (x86 and fusible ISA).

The two modes of the dual-mode decoder are named x86-mode and native mode. In x86-mode, x86 instructions are fetched from memory, and both decode levels are used. In native-mode, hotspot translated implementation instructions are fetched from the code cache. These fusible ISA instructions bypass the first level decoder and only go through the second level decoder. With the dual-mode decoders, both architected ISA (x86) code and implementation ISA instructions can be processed by the same pipeline. The ability to support x86 mode eliminates the need for BBT, along with its translation overhead and any side effects on the memory hierarchy.






Figure 5.23 Dual mode x86 decoders in a superscalar pipeline

As the processor runs, it switches back and forth between x86-mode and native-mode, under the control of VMM software. The x86-mode is entered if a piece of x86 code has no translation in the code cache; this is done when a program starts up, for example. As the program runs, some parts become hotspot. Once a hotspot has been translated and optimized, the VMM software switches to native mode to take advantage of its efficiency and performance.

When executing in x86-mode, instructions pass through both decode levels (Figure 5.2). In this case, the dual-mode decoders generate micro-ops with a code quality similar to conventional x86 decoders. Furthermore, the macro-op execution pipeline is an enhanced superscalar that processes non-fused micro-ops in a similar way as a conventional superscalar, except the scheduler is pipelined. The pipelined scheduler loses the back-to-back issue capability for dependent micro-ops that are not fused; however, it also leads to higher clock speeds. Therefore, in x86-mode, performance will be similar to a conventional superscalar x86 processor.

When executing in native mode, fused RISC-style micro-ops pass through only the second level of the decoder (Figure 5.2), leading to a shorter pipeline frond-end for branch misprediction penalty. The complex first level of the dual mode decoder can be turned off.

A side effect of using the dual-mode approach is that profiling software cannot be embedded into BBT generated code -- because there is no BBT code. As a consequence, the design should employ profiling hardware similar to that used by Merten et al. [96, 98]. This hardware’s sole function is to detect hotspots. When a hotspot is detected, the hardware invokes the VMM software which can then organize hotspot code into superblock(s), translate and optimize it, and place the optimized superblock(s) in the code cache.

Dual mode decoders are fast and fit well in a conventional superscalar design. The replacement of a single-level decode table with a two-level decoder represents a good hardware tradeoff which results in fewer transistors, as explained by the Motorola 68000 designers [119]. This approach, when extended to dual mode operation, adds relatively little extra hardware to a conventional two-level CISC decoder implementation -- the bypass path around the first level decoder.


Download 0.61 Mb.

Share with your friends:
1   ...   15   16   17   18   19   20   21   22   ...   29




The database is protected by copyright ©ininet.org 2024
send message

    Main page