The previous section concentrates on the kernel half of the macro-op fusing algorithm – the macro-op discovery algorithm that identifies appropriate dependent instruction pairs, and marks them to be fused into macro-ops. However, to complete the macro-op fusing algorithm, there is another half of the fusing algorithm: macro-op code scheduling that reorders instructions into dependent pairs.
Code scheduling algorithms in modern compilers typically re-order instructions to group independent instructions close together to enable better ILP. This kind of code scheduling proceeds by scheduling the exposed-set of an instruction dependency graph. However, the macro-op fusing algorithm needs to group dependent instruction pairs together. This cannot be achieved via conventional code scheduling manipulations.
After the macro-op discovery algorithm (the macro-op fusing algorithm discussed in the previous section – Section 4.4), data dependences have been considered and fusible pairs have been marked. Therefore the code scheduling algorithm considers the following. For a pair of identified fusible instructions, are there any other constraints or concerns that prevent moving the middle instructions (instructions between the head and the tail) to either before the head or after the tail? Such constraints could be memory ordering, interactions between different pairs, and any other special-case ordering concerns.
The code scheduling algorithm is listed in Figure 4.4. The major method, DepCodeScheduler, takes the original instruction sequence marked with macro-op fusing information as the input. It then processes the original sequence as an instruction (uops in the figure) stack and processes the uop(s) on the stack top. The output of the algorithm is a scheduled new sequence with most of the fusible pairs arranged consecutively and marked as macro-ops.
Algorithm: DepCodeScheduler (In: uops-seq, Out: macro-op-seq)
1. foreach (uop in uops-seq, from the last to the first) {
2. uops-stack.push(uop);
3. }
4. while (uops-stack is not empty) {
5. uop = uops-stack.pop();
6. if (uop is not fused) {
7. Gen_mops (uop, mops_single macro-op-seq);
8. }
9. else if (uop is head and the tail is consecutive){
10. head = uop; tail = uops-stack.pop();
11. Gen_mops (head, mops_head macro-op-seq);
12. Gen_mops (tail, mops_tail macro-op-seq);
13. }
14. else {
15. MidSet = uops-stack.pop(the set of uops between the head and tail);
16. head = uop; tail = uops-stack.pop();
17. can_fuse = partition(MidSet PreSet, PostSet);
18. if (can_fuse == true) {
19. DepCodeScheduler (PreSet macro-op-seq);
20. Gen_mops (head, mops_head macro-op-seq);
21. Gen_mops (tail, mops_tail macro-op-seq);
22. uops-stack.push(PostSet);
23. }
24. else {
25. Gen_mops (uop, mops_single macro-op-seq);
26. uops-stack.push(MidSet + tail);
27. }
28. }
29.}
Figure 4.15 Code scheduling algorithm for grouping dependent instruction pairs
The algorithm operates in the following way.
A non-fused uop at the top of the stack is popped off the stack and generates a single instruction in the fusible ISA. The instruction is placed in the output buffer. This is shown from Line 6 to Line 8 in the algorithm.
A pair of consecutive and dependent instructions at the top of the stack that are marked for fusing are popped off the stack and generate two consecutive instructions in the fusible ISA. The head instruction has its fusible bit set to mark the two instructions as a fused macro-op (L9~L13).
If the uop at the top of the stack is part of a marked fusible pair, but the two instructions are separated by a set of middle instructions (called MidSet), the algorithm first pops off all the head, tail and the MidSet instructions. Then the algorithm invokes a sub-algorithm to partition the MidSet into two sets. One is called PreSet that includes instructions that must be re-ordered before the marked macro-op head; the other is called the PostSet and it holds those instructions that can be re-ordered after the tail of the marked macro-op tail. If this partition fails due to memory ordering or other conditions, the head instruction is generated as a single instruction the fusing opportunity identified by the macro-op discovery algorithm is abandoned (Note: this re-ordering is done only for fusing dependent instruction pairs; the ordering between memory operations is strictly maintained as in the original x86 instruction sequence to maintain memory consistency model). Otherwise, if the partition succeeds, the scheduling algorithm recursively schedules the PreSet first, and then generates the macro-op pair being processed. Finally it pushes all the PostSet instructions back to the stack for further processing (L14 ~ L28).
The sub-algorithm, partition, is designed to honor all the original dependences in the original x86 code sequences, including memory operation ordering. The partition sub-algorithm prefers to assign instructions to PostSet for future processing. PreSet only holds instructions that must be scheduled before the head to maintain the correctness of the re-ordering.
4.6Simple Emulation: Basic Block Translation
As illustrated in Chapter 3, for most workloads, hotspots constitute only a small fraction of all executed static instructions. If the full-blown SBT procedure described above is applied to all the static instructions executed, the runtime overhead would be very significant. Therefore, like many other adaptive translation systems, the x86vm DBT software employs the staged translation strategy as discussed in Chapter 2 and 3. When instructions are first encountered, the VM uses fast and simple basic block translation (BBT). BBT improves performance over interpretation by (a) reducing the x86 instruction fetch/decode overhead to only one time per basic block; (b) exploiting native register mapping of the x86 architected state as well as specializing the native emulation code for each x86 instruction; (c) facilitating hardware accelerators to speed up the BBT translation. With these objectives, BBT is more efficient than interpretation as later results demonstrate. Because no optimization is performed, a basic block is selected as the translation unit to reduce code management complexity and avoid tail replication for translation units beyond a basic block.
The basic block translator fetches and decodes x86 instructions one-by-one. After each x86 instruction is discovered, BBT cracks it into multiple fusible ISA instruction(s) and stores the translation into the BBT code cache. The register mapping convention in Section 4.3 is enforced to communicate with other translations without overhead. Conceptually, the functionality of this BBT translation is very similar to the hardware decoders at the pipeline front-end of contemporary x86 processors [37, 51, 53, 58, 74].
However, there are several DBT specific concerns. The first DBT-specific issue regards branches. Each x86 branch instruction is translated into a corresponding fusible ISA branch instruction. If the branch target is known and the target translation is available, the fusible ISA branch transfers control directly to the target translation. Otherwise, the fusible ISA branch transfers control back to the VMM runtime, which will patch the branch and link directly to the target translation that will be available at the VMM link time. This translation control transfer mechanism is the same as that of the superblock translation.
The second DBT-specific regards hotspot profiling. In many VM systems, including the IBM Daisy and Transmeta co-designed VM systems, the initial emulation system also performs profiling and this adds extra overhead. The x86vm framework features special hardware support to detect hotspots. The goal is to reduce runtime overhead and the BBT translation complexity. As will be seen in Chapter 5, the removed BBT profiling will also facilitate hardware acceleration of the BBT translation process.
The third DBT-specific issue is code caching for BBT translations. Like optimized hotspot translations, BBT translations are cached in a BBT code cache for reuse. This cache space may require extra memory space. Typically, BBT translations are not repeatedly reused over a long period, however, so flushing a limited BBT code cache is acceptable, especially if BBT can re-translate very quickly.
Share with your friends: |