The mapping from an architected ISA to an implementation ISA is performed by either hardware or software in real processor designs.
Both Intel and AMD x86 processors [37, 51, 53, 58, 74] translate from the x86 instruction set to the internal RISC-style micro-ops (implementation ISA instructions) via hardware decoders. As already pointed out, the advantage of hardware decoders is very fast startup performance. The disadvantage is extra hardware complexity at the pipeline front-end and limited capability for translation/optimization due to context-free decoders. Regarding native code quality, it has been observed that suboptimal internal code [114] is a major issue for these hardware-intensive approaches.
Transmeta x86 processors, from the early Crusoe [54, 82] to the later Efficeon [83, 122], perform ISA mapping using dynamic binary translation systems called CMS (Code Morphing Software). These software translation systems eliminate x86 hardware decoding circuits that run continuously. CMS exploits a staged, adaptive translation strategy to spend appropriate amount of optimizations for different parts of the program code. It performs runtime hotspot optimization cost-effectively and with more integrated intelligence. Although there is no published data about CMS runtime translation overhead, it is projected to be quite significant for benchmarks or workloads such as Windows applications [15, 82,83]. Transmeta Efficeon processors also introduced some hardware assists [83] for the CMS interpreter. However, the details are not published.
There are also prior research efforts in the co-designed VM paradigm. IBM co-designed VMs DAISY [41] BOA [3] use DBT software to map PowerPC binaries to a VLIW hardware engine. The startup performance is not explicitly addressed and the translation overhead is projected to be at least similar to that of the Transmeta CMS systems [41, 83].
A characteristic property of VM systems is that they usually feature translation/optimization software and a code cache. The code cache resides in a region of physical memory that is completely hidden from all conventional software. In effect the code cache [13, 41] is a very large trace cache. The software is implementation-specific and is developed along with the hardware design.
All the related co-designed VM systems discussed above employ in-order VLIW pipelines. As such, considerably heavier software optimization is required for translation and re-ordering instructions. In this thesis, we explore an enhanced superscalar microarchitecture, which is capable of dynamic instruction scheduling and dataflow graph collapsing for better ILP.
The ILDP project [76, 77] implements a RISC ISA (Alpha) with a co-designed VM. Because the underlying new ILDP implementation ISA and microarchitecture is superscalar-like that reorder instructions dynamically, their DBT translation is much simpler than mapping to VLIW engines. However, the startup issue was not addressed [76, 78].
This thesis explicitly addresses the startup issue, as well as the issue of quality native code generated by DBT. The approach taken in this research carries the co-designed hardware/software philosophy further by exploring simple hardware assists for DBT. The evaluation experiments are conducted for a prominent CISC architected ISA, the x86.
1.5Overview of the Thesis Research
The major contributions in this thesis research are the following.
-
Performance modeling of DBT systems. A methodology for modeling and analyzing dynamic translation overhead is proposed. The new approach enables the understanding VM runtime behavior --- it models VM system performance from a memory hierarchy perspective. Major sources of overhead and potential solutions are then easily identified.
-
Hardware / software co-designed DBT systems. A hardware / software co-designed approach is explored for improving dynamic binary translation systems. The results support enhancing the VMM by applying a more balanced software translation strategy and by adding simple hardware assists. The enhanced DBT systems demonstrate VM startup performance that is very competitive with conventional hardware translation schemes. Meanwhile, an enhanced VM system can achieve hardware simplicity and translation/optimization capabilities similar to software translation systems.
-
Macro-op execution microarchitecture (Joint work with Kim and Lipasti [63]). An enhanced superscalar pipeline, named macro-op execution, is proposed and studied to implement the x86 instruction set. The new microarchitecture shows superior steady-state performance and efficiency by first cracking x86 instructions into RISC-style micro-ops and then fusing dependent micro-op pairs into macro-ops that are streamlined for processor pipeline. Macro-ops are treated and processed as single entities throughout the entire pipeline. Processor efficiency is improved because the fused dependent pairs not only reduce inter-instruction communication and instruction level management, but also collapse dataflow graph to improve ILP.
-
An example co-designed x86 virtual machine system. To evaluate the significance of the above individual contributions, we design an example co-designed x86 virtual machine system that features the efficient macro-op execution engine. The overall approach is to integrate the discovered valuable software strategies and hardware designs into a synergetic VM system. Compared with conventional x86 processor designs, the example VM system demonstrates overall superior steady state performance and competitive startup performance. The example VM design also inherits the complexity-effectiveness of the VM paradigm.
The rest of the dissertation is organized as follows.
Chapter 2 introduces the x86vm framework that serves as the primary vehicle for conducting this research. Then a baseline co-designed x86 virtual machine is proposed for further investigation of the VM system. The baseline VM represents a state-of-the-art VM that employs software-only DBT. Then, the three major VM components, the new microarchitecture, the co-designed VM software and the implementation ISA are described.
Chapter 3 addresses the translation strategy. It presents a performance modeling methodology for VM systems from a memory hierarchy perspective. The dynamics of translation-based systems are explored within this model. Then, an overall translation strategy for reducing VM runtime overhead is proposed.
Chapter 4 addresses the translation software that determines the efficiency of translated native code in the proposed VM system. I discuss the major technical issues such as translation and optimization algorithms that generate efficient native code for the macro-op execution microarchitecture. Meanwhile, the algorithms are aware of translation efficiency to reduce overhead.
Chapter 5 addresses the translation hardware support. I propose simple hardware assists for binary translators. This chapter discusses the hardware assists from architecture, microarchitecture, and circuit perspectives, along with some analysis of their complexity. I also discuss other related hardware assists that are not explicitly studied in this thesis.
Chapter 6 emphasizes balanced synergetic integration of all VM aspects addressed in the thesis via a complete example co-designed x86 virtual machine system. The complete VM system is evaluated and analyzed with respect to the specific challenges architects are facing today or will face in the near future. Evaluations are conducted via microarchitecture timing simulation.
Chapter 7 summarizes and concludes the thesis research.
Because co-designed virtual machine systems involve many aspects of hardware and software, I evaluate individual thesis features and discuss the related work in each chapter. That is, evaluation and related work are distributed among the chapters.
Share with your friends: |