Abstract ii
-
Introduction 1
1.1The Dilemma: Legacy Code and Novel Architectures 2
1.2Answer: The Co-Designed Virtual Machine Paradigm 4
1.3Enabling Technology: Efficient Dynamic Binary Translation 6
1.4Prior Work on Co-Designed VMs 10
1.5Overview of the Thesis Research 12
-
The x86vm Experimental Infrastructure 15
2.1The x86vm Framework 16
2.2Evaluation Methodology 23
2.3x86 Instruction Characterization 25
2.4Overview of the Baseline x86vm Design 30
2.4.1Fusible Implementation ISA 31
2.4.2Co-Designed VM Software: the VMM 33
2.4.3Macro-Op Execution Microarchitecture 34
2.5Related Work on x86 Simulation and Emulation 37
-
Modeling Dynamic Binary Translation Systems 40
3.1Model Assumptions and Notation 41
3.2Performance Dynamics of Translation-Based VM Systems 43
3.3Performance Modeling and Strategy for Staged Translation 48
3.4Evaluation of the Translation Modeling and Strategy 53
3.5Related Work on DBT Modeling and Strategy 58
-
Efficient Dynamic Binary Translation Software 60
4.1Translation Procedure 61
4.2Superblock Formation 62
4.3State Mapping and Register Allocation for Immediate Values 63
4.4Macro-Op Fusing Algorithm 65
4.5Code Scheduling: Grouping Dependent Instruction Pairs 72
4.6Simple Emulation: Basic Block Translation 75
4.7Evaluation of Dynamic Binary Translation 76
4.8Related Work on Binary Translation Software 88
-
Hardware Accelerators for x86 Binary Translation 94
5.1Dual-mode x86 Decoder 94
5.2A Decoder Functional Unit 98
5.3Hardware Assists for Hotspot Profiling 103
5.4Evaluation of Hardware Assists for Translation 105
5.5Related Work on Hardware Assists for DBT 113
-
Putting It All Together: A Co-Designed x86 VM 116
6.1Processor Architecture 117
6.2Microarchitecture Details 120
6.2.1Pipeline Frond-End: Macro-Op Formation 121
6.2.2Pipeline Back-End: Macro-Op Execution 124
6.3Evaluation of the Co-Designed x86 processor 129
6.4Related Work on CISC (x86) Processor Design 142
-
Conclusions and Future Directions 149
7.1Research Summary and Conclusions 150
7.2Future Research Directions 153
7.3Reflections 157
Bibliography 162
Table 2.1 Benchmark Descriptions 24
Table 2.2 CISC (x86) application characterization 26
Table 3.1 Benchmark Characterization: miss events per million x86 instructions 56
Table 4.2 Comparison of Dynamic Binary Translation Systems 91
Table 5.3 Hardware Accelerator: XLTx86 99
Table 5.4 VM Startup Performance Simulation Configurations 106
Table 6.5 Microarchitecture Configurations 130
Table 6.6 Comparison of Co-Designed Virtual Machines 146
Figure 1.1 Co-designed virtual machine paradigm 5
Figure 1.2 Relative performance timeline for VM components 8
Figure 2.3 The x86vm Framework 17
Figure 2.4 Staged Emulation in a Co-Designed VM 21
Figure 2.5 Dynamic x86 instruction length distribution 29
Figure 2.6 Fusible ISA instruction formats 31
Figure 2.7 The macro-op execution microarchitecture 36
Figure 3.8 VM startup performance compared with a conventional x86 processor 47
Figure 3.9 Winstone2004 instruction execution frequency profile 50
Figure 3.10 BBT and SBT overhead via simulation 53
Figure 3.11 VM performance trend versus hot threshold settings 54
Figure 4.12 Two-pass fusing algorithm in pseudo code 67
Figure 4.13 Dependence Cycle Detection for Fusing Macro-ops 69
Figure 4.14 An example to illustrate the two-pass fusing algorithm 70
Figure 4.15 Code scheduling algorithm for grouping dependent instruction pairs 73
Figure 4.16 Macro-op Fusing Profile 78
Figure 4.17 Fusing Candidate Pairs Profile (Number of Source Operands) 80
Figure 4.18 Fused Macro-ops Profile 82
Figure 4.19 Macro-op Fusing Distance Profile 84
Figure 4.20 BBT Translation Overhead Breakdown 86
Figure 4.21 Hotspot (SBT) Translation Overhead Breakdown 87
Figure 5.22 Dual mode x86 decoder 96
Figure 5.23 Dual mode x86 decoders in a superscalar pipeline 97
Figure 5.24 HW accelerated basic block translator kernel loop 99
Figure 5.25 Hardware Accelerator microarchitecture design 102
Figure 5.26 Startup performance: Co-Designed x86 VMs compared w/ Superscalar 108
Figure 5.27 Breakeven points for individual benchmarks 108
Figure 5.28 BBT translation overhead and emulation cycle time 110
Figure 5.29 Activity of hardware assists over the simulation time 113
Figure 6.30 HW/SW Co-designed DBT Complexity/Overhead Trade-off 118
Figure 6.31 Macro-op execution pipeline modes: x86-mode and macro-op mode 120
Figure 6.32 The front-end of the macro-op execution pipeline 121
Figure 6.33 Datapath for Macro-op Execution (3-wide) 126
Figure 6.34 Resource requirements and execution timing 128
Figure 6.35 IPC performance comparison (SPEC2000 integer) 132
Figure 6.36 IPC performance comparison (WinStone2004) 134
Figure 6.37 Contributing factors for IPC improvement 137
Figure 6.38 Code cache footprint of the co-designed x86 processors 141
Share with your friends: |