In a co-designed VM, a major component of the VMM is dynamic binary translation (DBT) that maps architected ISA binaries to implementation ISAs. And it is this ISA mapping that causes the major runtime overhead. Hence, efficient DBT is the key enabling technology for the co-designed VM paradigm.
Since a co-designed VM system is intended to enable an innovative efficient microarchitecture, it is implied that the translated native code executes more efficiently than conventional processor designs. The efficiency advantage comes from the new microarchitecture design and from the effectiveness or quality of the DBT system co-designed with the new microarchitecture. Once the architected ISA code has been translated, the processor achieves a steady state where it only executes native code.
Before the VM system can achieve steady state, however, the VM system first must invoke DBT for mapping ISAs, thereby incurring an overhead. This process is defined as the startup phase of the VM system. The translation overhead (per architected ISA instruction) of a full-blown optimizing DBT is quite heavy, on the order of thousands of native instructions per translated instruction. For example, DAISY [41] takes more than four thousands native operations to translate and optimize one PowerPC instruction for its VLIW engine. The translation (per Alpha instruction) to the superscalar-like ILDP ISA takes about one thousand Alpha instructions [76, 78]. To reduce the heavy DBT overhead, VM systems typically take advantage of the fact that for most applications, only a small fraction of static instructions execute frequently (the hotspot code). Therefore, an adaptive/staged translation strategy can reduce overall DBT overhead. That is, staged emulation uses a light-weight interpreter or simple straightforward translator to emulate infrequent code (cold code) and thus avoid the extra optimization overhead. The reduced optimization overhead for cold code comes at the cost of inferior VM performance when emulating cold code. Both hotspot DBT optimization time overhead and inferior cold code emulation performance contribute to the so called slow startup problem for VM systems. And slow startup has long been a major concern regarding the co-designed VM paradigm because slow startup can easily offset any performance gains achieved while executing translated native code.
Figure 1.2 illustrates startup overheads using benchmarks and architectures described in more detail in Section 3.2. The figure compares startup performance of a well-tuned, state-of-the-art VM model with that of a conventional superscalar processor running a set of Windows application benchmarks. The x-axis shows time in terms of cycles on logarithmic scale. The IPC performance shown on the y-axis is normalized to steady state performance that a conventional superscalar processor can achieve. And the horizontal line across the top of the graph shows the VM steady-state IPC performance (superior to the baseline superscalar). The graphed IPC performance is the aggregate IPC, i.e. the total instructions executed up to that point in time divided by the total time. At a give point in time, the aggregate IPCs reflect the total numbers of instructions executed, making it easy to visualize the relative overall performance up to that time.
The relative performance curves illustrate how slowly the VM system starts up when compared with the baseline superscalar. An interesting measure of startup overhead is the time it takes for a co-designed VM to “catch up” with the baseline superscalar processor. That is, the time at which the co-designed VM has executed the same number of instructions (as opposed to the time at which the instantaneous IPCs are equal, which happens much earlier). In this example, the crossover, or breakeven, point occurs at around 200-million cycles (or 100 milliseconds for a 2.0 GHz processor core).
Figure 1.2 Relative performance timeline for VM components
Clearly, long-running applications with small, stable instruction working sets can benefit from the co-designed VM paradigm with startup overheads of this magnitude. However, there are important cases where slow startup can put a co-designed VM at a disadvantage when compared with a conventional processor.
Example cases include:
-
Workloads consisting of many short-running programs or fine-grained cooperating tasks: execution may finish before the performance lost to slow startup can be compensated for.
-
Real-time applications: real-time constraints can be compromised if any real-time code is not translated in advance and has to go through the slow startup process.
-
Multitasking server-like systems that run large working-set jobs: the slow startup process can be further exacerbated by frequent context switches among resource competing tasks. A limited code cache size causes hotspot translations for a switched-out task being replaced. Once the victim task is switched back in, the slow startup has to be repeated.
-
OS boot-up or shut-down: OS boot-up/shut-down performance is important to many client side platforms such as laptops and mobile devices.
It is clear that the co-designed VM paradigm can provide a complexity-effective solution if dynamic binary translation system can be made efficient. Therefore, the major objectives of this research are to address two complementary aspects of efficient binary translation: an efficient dynamic binary translation process and efficiently executing native code generated by the translation process.
An efficient dynamic binary translation process speeds up the startup phase by reducing runtime translation overhead. Using hardware translation results in a practically zero runtime overhead at the cost of extreme complexity whereas software translation provides simplicity and flexibility at the cost of runtime overhead. Therefore, the objective here is to find hardware/software co-designed solutions that ideally demonstrate overheads (nearly) as low as purely hardware solutions, and simultaneously feature the same level of simplicity and flexibility as software solutions. The feasibility of such an overall approach relies on applying more advanced translation strategies and adding only simple hardware assists that accelerate the critical part of the translation process (again, primitives). In this thesis, we search for a comprehensive solution that combines efficient software translation algorithms, simple hardware accelerators and a new adaptive translation strategy balanced for hotspot performance advantages and its translation overhead.
Efficient native code execution affects VM performance mainly for program hotspots. The higher performance translated native code runs, the more efficiency and benefits the VM system achieves. To serve as a research vehicle that illustrates how efficient microarchitectures are enabled by the VM paradigm cost-effectively, we explore a specific co-designed x86 virtual machine in detail. This example VM features macro-op execution [63] to show that a co-designed virtual machine can provide elegant solutions for real world architected ISA such as the x86.
Share with your friends: |