Chapter 7
The co-designed virtual machine paradigm has a long history dating back at least to the IBM System 38 [17] and AS/400 systems [12]. From its conception, the co-designed VM paradigm has been intended to handle machine interface complexities via its versatility. It has motivated pioneer projects from IBM, Transmeta and a few others. However, due to technical challenges and non-technical inertia, the application of this paradigm is still quite limited.
As the two fundamental computer architecture elements (ever-expanding applications and ever-evolving implementation technology) continue, perhaps it will no longer be possible to avoid the challenge of running a huge body of standard ISA software on future novel architectures. The co-designed VM paradigm carries the potential to mitigate such a fundamental tension for architecture innovations.
In this thesis, I have followed the pioneers’ insight and have attempted to tackle the challenges they faced. The objective is a systematic research of the co-designed VM paradigm. During the thesis research, I conquered obstacles and made a number of discoveries. In this final chapter, I summarize the findings and conclude in Section 7.1. The encouraging conclusions suggest continuing research effort, and I discuss future directions in Section 7.2 which intends to justify investment in and application of the paradigm. Finally, Section 7.3 contains some reflections.
7.1Research Summary and Conclusions
To evaluate a design paradigm for computer systems, the following three aspects are fundamental: capability, performance, and complexity or cost effectiveness. The entire thesis research is summarized and evaluated according to these three dimensions.
Capability
It is software that provides eventual solutions to computing problems. Therefore, binary compatibility practically implies capability, that is, the ability to run the huge body of available software distributed in standard binary formats, usually a widely used legacy ISA such as the x86. The pioneers from IBM and Transmeta have already proved the fundamental aspect that the hardware/ software co-designed VM paradigm can maintain 100% binary compatibility.
In this thesis, we did not extensively discuss this most important issue. However, in the experimental study, I did not encounter any significant obstacle regarding compatibility. Perhaps the real question boils down to emulation efficiency via ISA mapping. An informal discussion of this consideration is included in Section 7.3.
System functionality integration and dynamic upgrading is another manifestation of capability. In conventional processor designs, because of hardware complexity concerns, many desired functionalities are not integrated for practical reasons. For example, some of the attractive features include dynamic hardware resource management, runtime adaptation of critical system parameters, advanced power management and security checks, among many others. Furthermore, because a hardware bug cannot be fixed without calling back delivered products, there must be an exhaustive and expensive verification to guarantee 100% correctness for the integrated functionalities. Moreover, features implemented on a chip cannot be easily upgraded.
In contrast, the co-designed VM paradigm takes advantage of flexible and relatively cheap software to implement certain functionality; software components can be patched and upgraded at a much more affordable price. This practical functionality integration and dynamic update capability enables many desirable processor features regarding runtime code transformation and inspection for performance, security, reliability etc.
In this thesis, we implemented the x86 to fusible ISA mapping via co-designed software. The fusing algorithms are often beyond complexity-effective hardware design envelope. More runtime optimizations can be added, revised or enhanced later and the upgrade of the VM software is simply a firmware download similar to a BIOS update. Rigorous verification is still needed, but at a much lower cost than pure hardware requires.
Performance
After solutions become available, performance is usually the next concern. The eventual goal for the co-designed VM paradigm is to enable novel computer architectures without legacy concerns. Therefore, by its nature, the VM should have superior performance and efficiency at least in steady state, which is the dominant runtime phase for most applications. VM startup is affected the most by ISA translation overhead and consequently has long been a concern. This thesis work demonstrates that by combining balanced translation strategy, efficient translation software algorithms, and primitive hardware assists, VM startup behavior can be improved so that it is very competitive with conventional processor designs. Overall, the VM paradigm promises significant future performance potential.
In this thesis, I proposed an example co-designed x86 virtual machine that features a macro-op execution microarchitecture. One of the purposes of the co-designed x86 processor is to illustrate an efficient, high performance VM design. By fusing dependent instruction pairs, and processing fused macro-ops throughout the entire pipeline, processor resources are better utilized and efficiency is improved. Fused macro-ops reduce instruction management and inter-instruction communication. Furthermore, fused macro-ops collapse the dynamic dataflow graph to better extract ILP. Overall, performance is improved via both higher ILP and faster clock speed. Primitive hardware assists, such as dual-mode decoders or translation-assist functional unit(s), ensure that the VM can catch up with conventional high performance superscalar designs during the program startup phase. After the DBT translator detects hotspots and fuses macro-ops for the macro-op execution engine, the VM steady state IPC performance improves by 18% and 8% over a comparable superscalar design for the benchmarked SPEC2000 and Windows workloads, respectively. Another significant performance boost comes from the faster clock speed potential that is enabled by the pipelined instruction scheduler and the simplified result forwarding network.
Complexity
Modern processors, especially high end designs, have become extremely complex. There are several consequent implications due to extreme complexity, i.e. longer time-to-market, higher power consumption and lower system reliability among many others.
The co-designed VM paradigm enables novel efficient system designs and shifts functionality to software components when appropriate. This approach fundamentally reduces overall complexity of a computer system. Less functionality implemented with transistors not only means reduced product design complexity, but also implies, if properly engineered, reduced power consumption and reduced probability for undetected reliability flaws.
In this thesis, the example co-designed processor also demonstrates complexity-effectiveness. From the pipeline front-end to backend: x86 decoders can be removed if a simplified backend functional unit is equipped for translation; the instruction scheduler logic can be pipelined for simplified wakeup and select logic designs; the unique coupling of pipelined two-cycle scheduler with 3-1 collapsed ALU(s) removes the need for a complex ALU-to-ALU operand forwarding/bypass network. Additionally, if the design goal is to maintain a comparable performance to conventional designs, then the improved pipeline efficiency leads to a reduced pipeline width for better complexity effectiveness. The example VM design requires only minimal revisions to the current mature and well-proven superscalar designs. Therefore, the time-to-market and reliability should be able to maintain at least similar to current product levels. Power consumption should be able to decrease further as fused macro-ops can collect more efficiency benefits than Pentium M does from reduced instruction traffic through the pipeline.
In short, the co-designed VM paradigm provides more versatility and flexibility for processor designers. This flexibility advantage can be converted to capability, performance and complexity effectiveness advantages for future processor designs. In this thesis research, we demonstrate that the key enabling technology, efficient dynamic binary translation, can be achieved via sound engineering. Efficient DBT incurs acceptable overhead and enables the translated code to run efficiently. The combination of efficient DBT and a co-designed processor enable new efficient microarchitecture designs that are beyond the conventional microprocessor technology.
Share with your friends: |