The front-end dual mode decoder described in the previous subsection modifies a critical part of the processor pipeline and must be able to decode instructions at full bandwidth. Furthermore, it must be designed to implement the complete architected ISA (x86). However, dynamic binary translation software may fit better with flexible, programmable and more complexity-effective hardware assists.
An alternative approach is to implement the hardware assist in the form of a programmable functional unit at the pipeline back-end. A functional unit at the execution stage is less intrusive than the front-end dual mode decoder, does not need to provide the high bandwidth of the front-end decoders, and can target only the common cases, not all cases.
As previously illustrated, during initial emulation, BBT introduces the major runtime overhead, and the dominant part of BBT is to decode and crack x86 instructions into micro-ops. According to our measurements, on average, about 90 out of the 106 µops overhead for translating each x86 instruction in our BBT system is related to instruction decoding and cracking. Therefore, a functional unit that performs these operations should significantly speed up BBT.
Table 5.3 Hardware Accelerator: XLTx86
New Instruction
|
BRIEF Description
|
XLTx86 Fsrc, Fdst
|
Decode an x86 instruction aligned at the beginning of the 128-bit Fsrc register, and generate RISC-style 16b/32b uops into the Fdst register. This instruction affects the CSR status register
|
0. HAloop:
1. LD Fsrc, [Rx86pc]
2. XLTx86 Fdst, Fsrc
3. Jcpx complex_x86code
4. Jcti branch_handler
5. ST Fdst, [Rcode$]
6. MOV Rt0, CSR
7. AND Rt1, Rt0, 0x0f :: ADD Rx86pc, Rt1
8. AND.x Rt2, Rt0, 0xf0 :: ADD Rcode$, Rt2
9. JMP HAloop
(a). Code for the HW assisted fast BBT loop
(b). The control and status register (CSR) format for XLTx86
Figure 5.24 HW accelerated basic block translator kernel loop
We propose such a backend functional unit that is accessed through a new instruction in the implementation ISA. Table 5.1 briefly describes the new instruction: XLTx86. XLTx86 accesses the 128-bit F registers that are architected for mapping the x86 FP/media states. Additionally, XLTx86 operates on a special flag status register CSR that is explained below.
Figure 5.3a illustrates the kernel loop used by the VMM for hardware accelerated BBT (in the implementation ISA assembly language). Rx86pc is an implementation register that holds the architected x86 PC value; this register points to an instruction in the x86 instruction memory.
To be more specific, x86 instructions are fetched by a load operation into register Fsrc. Because x86 instructions are from one byte to seventeen bytes long (and very few are more than eleven bytes in real code), the Fsrc register holds at least one x86-instruction. The fetched x86 instruction is aligned at the beginning of the Fsrc register. The next instruction, XLTx86, then decodes and cracks the x86-instruction into uop(s). The input to XLTx86 is the Fsrc register. The output uops are placed in the Fdst register and flags are set in the CSR status register. The format of the flag status register CSR is shown in Figure 5.3b. The 4-bit x86_ilen field returns the length of the x86 instruction. The 4-bit uops_bytes field returns the length of the generated uop(s) in the implementation ISA.
The Flag_cmplx bit is set if the x86-instruction being decoded is too complex for the hardware decoder. This escape mechanism keeps the hardware assist simple and fast by off-loading the complicated cases to software; for example, if the x86 instruction should happen to be more than 16 bytes (the size of the Fsrc register). The Flag_cti flag bit is set if the x86-instruction being processed is a control transfer instruction. After decoding, most x86-instructions are cracked into uops of no more than 16 bytes. Note that the 16-bit/32-bit fusible implementation ISA design implies that, only in a few rare cases, the 128b Fdst is too short to hold result uops; this is another case that is flagged as a complex instruction. Native uops in Fdst are written back to the code cache by a store operation. The rest of the loop does bookkeeping.
For architected state mapping, the CSR register can be mapped either to the same FP/media status register for x86 SIMD instructions, or to a separate implementation status register. Fsrc and Fdst are mapped to FP/media temporary registers F24 through F31.
For microarchitecture design, the new functional unit is located in the FP/media part of the processor core because it uses F registers to hold long x86 instructions and multiple uops (Figure 5.4). If implemented in superscalar style microarchitectures such as macro-op execution, the XLTx86 instruction would be dispatched to the FP/media instruction queue(s) and issued to the new functional unit via a FP/media issue port. XLTx86 can take multiple cycles to execute as do many other FP/media instructions. In our research, we assume XLTx86 takes four cycles. The x86 instruction bytes are supplied to the functional unit via streaming buffer and the generated uops are written back to memory directly without going through the data cache.
For circuit design, the functional unit for XLTx86 is essentially a simplified, one instruction wide, x86 decoder relocated in the execution stage of the FP/media core. For cost effectiveness, XLTx86 only needs to handle simple common cases. Frequent x86 instructions are handled by the decoding functional unit; complicated and rare x86 instructions set the Flag_cmplx flag in the CSR register to escape for VM software handling.
Figure 5.25 Hardware Accelerator microarchitecture design
The new instruction, XLTx86, speeds up BBT by accelerating the dominant part of the fetch, decode and crack, from tens of cycles for a software-only translator, to only a few cycles for BBT assisted by hardware. Meanwhile, because it is an instruction that provides a primitive operation (from the translator perspective), it offers the VMM flexibility and programmability beyond the dual mode decoders embedded in the pipeline front-end.
Compared with dual mode decoders, the functional unit only performs the BBT translation once for each basic block as long as the translation is not replaced. Although the BBT translation is still an extra overhead, the generated native code does not invoke further complex CISC decoding. Therefore, for non-hotspot emulation performance, the translations generated by BBT will likely perform similarly to the x86-mode enabled by the dual mode decoders. The BBT translation overhead, although significantly reduced by the XLTx86 assist, will likely to appear for cases where bursts of translations can occur.
The hardware complexity of the backend decoding functional unit is also less than the front-end dual mode decoders. As aforementioned, only frequent common cases are handled by the XLTx86 instruction, complex and rare cases are handled by software. Furthermore, just one such functional unit can achieve most of the translation performance boost. And a backend functional unit has very localized impact on the processor pipeline design.
The decoders for CISC instructions are power hungry circuits [54] and it is the complex first level decoding logic that consumes most of the power. In conventional x86 processors, these decoders need to turn on both levels and consume power whenever x86 instructions are fetched from memory hierarchy. In contract, dual mode decoders save energy by turning off the complex first-level decode stage when native code is executing during steady state. The decoder as the backend function unit, on the other hand, only consumes power when a new piece of code is executed for the first time. Hence, the backend function unit has similar energy implication as software-only VM systems.
Share with your friends: |