University of wisconsin madison

Related Work on x86 Simulation and Emulation

Download 0.61 Mb.

Page	10/29
Date	13.05.2017
Size	0.61 Mb.
	#17847

1 ... 6 7 8 9 10 11 12 13 ... 29

Chapter 3
3.1Model Assumptions and Notation

2.5Related Work on x86 Simulation and Emulation

There are many x86 simulation and emulation systems. However, most of these systems are either proprietary infrastructure that are not available for public access, or are released only in binary format that can only support limited exploration for the x86 design space. For example, the Virtu-Tech SIMICS [91] supports full system x86 emulation and some limited timing simulation; the AMD SimNow! [112] is another full system x86 simulator running under GNU/Linux. The simulated system runs both 64-bit and 32-bit x86 OS and applications.

Fortunately, BOCHS [84] is an open-source x86 emulation project. The x86 instruction decode and instruction semantic routines from BOCHS 2.2 are extracted and fully customized to decode and crack x86 instructions into our own designed implementation instruction set. The customized source code is integrated into our x86vm framework (Figure 2.1) for full flexibility and support for the desired simulations and experiments.

Dynamic optimization for x86 is an active research topic. For example, an early version of rePLay [104] and recently the Intel PARROT [2] explored using hardware to detect and optimize hot x86 code sequences. The technical details of these related projects will be discussed in Chapter 5 when they are compared with our hardware assists for DBT. Merten et al [98] proposed a framework for detecting hot x86 code with BBB (Branch Behavior Buffer) and invoking software handlers to optimize x86 code. However, because the generated code is still in the x86 ISA, the internal sub-optimal code issue is not addressed.

The x86vm features a two-stage binary translation system, simple basic block translation for all code when it is first executed and superblock translation for hot superblock optimization. The Intel IA-32 EL [15] employs a similar translation framework. However, there are many variations to this scheme. For example, IBM DAISY/BOA [3, 41, 42] and Transmeta Code Morphing Software (CMS) [36] use an interpreter before invoking binary translators for more frequently executed code. The Transmeta Efficeon CMS features a four-stage translation framework [83] that uses interpretation to filter out code executed less than (about) 50 times. More frequent code invokes advanced translators based on its execution frequency.

The primary translation unit adopted in x86vm is the superblock [65]. A superblock is a sequence of basic blocks along certain execution path. It is amenable for translation dataflow analysis because of its single-entry and multi-exit property. Superblocks are adopted for translation units in many systems, such as Dynamo(RIO) [22], IA-32 EL [15]. However, some translation/compilation systems for VLIW machines supporting predication use tree regions [3, 41, 42] or other units larger than basic blocks as the translation unit.

Chapter 3 Modeling Dynamic Binary Translation Systems

The essence of the co-designed VM paradigm is synergetic hardware and software that implement an architected ISA. In contrast, conventional processor designs rely solely on hardware resources to provide the interface to conventional software. Therefore, it is critical to explore the dynamics of co-designed hardware and software for a clear understanding of VM runtime behavior. The achieved insight should help to improve the efficiency and complexity effectiveness of VM system designs. However, there are few publications that explicitly address the dynamics of translation-based co-designed VM systems.

In this chapter, we first develop an analytical model for staged translation systems from a memory hierarchy perspective. This model captures the first-order quantitative relationships between the major components in a VM system. Then we use this model to analyze VM runtime behavior and strive for an overall translation strategy that balances the translation assignments to different parts of the VM translation system.

3.1Model Assumptions and Notation

As discussed in Section 2.2 (evaluation methodology), it is easier to appreciate a new design by comparing it with the current best designs. Therefore, we are especially interested in comparing the following two processor design paradigms.

A reference superscalar paradigm -- the most successful general-purpose microarchitecture scheme in current processor designs. It dominates all the server, desktop, and laptop market and serves as our baseline. In conventional superscalar processors, limited translation is performed in the pipeline front-end every time an instruction is fetched.
The co-designed VM paradigm -- A hardware/software co-designed scheme that relies on the software dynamic translator to map instructions from the source architected ISA (x86) into the target implementation ISA. The hardware engine then can better realize microarchitecture innovations.

Note that, among the many dynamic binary translation/optimizations systems and proposals, we model systems that use software translation and code caching. In these systems, runtime software translation overhead is a major concern. The size of code cache in a co-designed VM system is typically configured from 10MB to 100MB out of the 512MB to multi-GB main memory size. For example, the Transmeta CMS [36, 82] allocates 16MB for its laptop or mobile device workloads; the IBM DAISY/BOA VMM [3, 41, 42] allocates 100+MB for server workloads.

There are also dynamic binary translation/optimization proposals that incur negligible performance overhead by investing intensive hardware resources for hotspot optimization. These proposals, for example, instruction path coprocessor [25, 26], rePLay [45, 104], and PARROT [2], are mainly designed for dynamic optimizations on the implementation ISA. They do not address code that is not in hotspots. Additionally, the generated translations are placed in a small on-chip trace cache or frame cache. Therefore, these proposals are not designed for conducting cost-effective, cross-paradigm translations that we are particularly interested in.

The software binary translation that we consider can be either modeled as a whole, i.e. a black box approach, or modeled as a structured object, i.e. a white box approach that distinguishes the major components inside. The specific modeling approach selected for a circumstance depends on the desired level of abstraction.

In our x86vm framework, the DBT system of the VM scheme may simply map one basic block at a time in a straightforward way (BBT) for fast startup, or it may perform optimizations on hot superblocks (SBT) for superior steady-state performance. A simple and straightforward way to model the runtime translation overhead is to unify the binary translation behavior as memory hierarchy miss behavior. For example, an invocation of the DBT translation is considered as a miss in the code cache and the miss event handling involves the VM translator. From such a memory hierarchy perspective, we introduce the following notation to model a staged translation system.

M_DBT denotes the total number of static instructions that are translated by the DBT system. For the x86vm framework, M_BBTis used to represent the number of static instructions touched by a dynamic program execution that hence need to be translated first by BBT. And, M_SBT represents the number of static instructions that are identified as hotspot and thus are optimized by SBT. This notation is very similar for memory misses. Assume I instructions have been executed, then the miss rate is m_lev = M_lev/I where the level, lev, can be BBT, SBT or DBT, just as caches, main memory or disk in the memory hierarchy.

Δ_DBT stands for the average translation overhead per (translated) architected ISA instruction. In particular, symbols Δ_BBT andΔ_SBTrepresent per x86 instruction translation overhead for BBT and SBT, respectively. The translation overhead can be either measured in terms of cycles (time) or in terms of the number of implementation ISA instructions in which the translators are coded. In this thesis, we use the measure in terms of implementation ISA instructions. Sometimes for comparison purposes, this number is converted to the equivalent number of architected ISA instructions.

Download 0.61 Mb.

Share with your friends:

1 ... 6 7 8 9 10 11 12 13 ... 29