For evaluating the proposed co-designed VM, we use the currently dominant processor design scheme, the superscalar microarchitecture, as the reference/baseline system. Ideally, the reference system would accurately model the best-performing x86 processor. However, for practical reasons (not the least of which are intellectual property issues), such a reference system is not available. For example, the internal micro-ops and key design details/trade-offs for real x86 processors are not publicly available. Consequently, the reference x86 processor design in this research is an amalgam of the AMD K7/K8 [37,74] and Intel Pentium M [51] designs. The reference design is based on machine configuration parameters such as pipeline widths, issue buffer sizes, and branch predictor table sizes that are published. The detailed reference configuration will be described in more detail in the specific evaluation sections.
Performance evaluation is conducted via detailed timing simulation. The simulation models for different processor designs are derived from the x86vm framework. For the reference x86 processors, modified BOCHS 2.2 [84] x86 decode/semantic routines are used for functional simulation. Then, RISC micro-ops are generated from the x86 instructions for simulation with the reference x86 timing simulator. The reference timing model is configured to be similar to the AMD K7/K8 and Intel Pentium M designs. For the co-designed VM designs, dynamic binary translators are implemented as part of the concealed co-designed virtual machine software. A simulation model of the x86vm pipeline is used for accurately modeling the detailed design of the various co-designed processor cores.
The SPEC2000 integer benchmarks and Winstone2004 Business Suite are selected as the simulation workload. A brief description of the benchmarks given in Table 2.1.
Benchmark binaries for the SPEC2000 integer benchmarks are generated by the Intel C/C++ v7.1 compiler with SPEC2000 –O3 base optimization. Except for 253.perlbmk, which uses a small reference input data set, the SPEC2000 benchmarks use the test input data set to reduce simulation time. SPEC2000 binaries are loaded by x86vm into its memory space and emulated by the extracted BOCHS2.2 code to generate “dynamic traces” for the rest of the simulation infrastructure. The adapted BOCHS code can also generate uops while performing the functional simulation.
Winstone2004 is distributed in binary format with an embedded data set. Full system traces are collected randomly for all the Windows applications running on top of the Windows XP operating system. A colleague, W. Chang, installed Window XP with the SP2 patch inside SimICS [91] and set up the Winstone 2004 benchmark. This system was then used for collecting traces that serve as x86 trace input streams to the x86vm framework. When processing these x86 trace files, the x86vm infrastructure does not need to perform functional emulation.
There are two important performance measurements, steady state performance and the startup performance. For steady state performance evaluation, long simulations are run to ensure steady state is reached. The SPEC2000 CPU benchmarks runs are primarily targeted at measuring steady state performance. All programs in SPEC2000 are simulated from start to finish. The entire benchmark suite executes more than 35 billion x86 instructions. For startup performance measurement, short simulations that stress startup transient behavior are used. Because Windows applications are deemed to be challenging for startup performance, especially for binary translation based systems, we focus on Windows benchmarks for the startup performance study.