10.6Kernels
A stream processor’s DPU contains multiple arithmetic-logical units (ALUs). Its VLIW design allows it to execute operations on multiple ALUs in each clock cycle. The Stream compiler spc VLIW scheduler maps kernel source code to ALU instructions, scheduling on a basic block basis. Increasing the size of a basic block often provides the scheduler with opportunities to schedule code more efficiently.
A programmer should think carefully about algorithms when designing kernels. For example, different data layouts can lead to dramatically different performance results. A kernel that computes a digital filter, where each output is the sum of n filter coefficients times n inputs, might place successive input elements in successive lanes or alternatively might place groups of successive elements in each lane, with very different performance impact.
The performance of a kernel’s inner loops typically determines its overall performance. This section presents some techniques to improve the performance of kernel inner loops.
10.6.1Tune
A kernel performs with maximum efficiency if it uses all of the available ALUs to do useful work in all lanes. A kernel’s VLIW utilization is the number of operations it executed divided by the maximum number of operations that could have been executed in the same number of cycles. SIMD utilization is the average percentage of lanes doing useful work.
Unlike sequential programming, where increasing the number of operations required by a computation degrades performance, in VLIW programming sometimes excess ALUs are available at no performance cost. To tune a kernel, improve its VLIW utilization and its SIMD utilization.
Double-clicking on a kernel execution rectangle in a profile visualization brings up a kernel visualization. For gsr_compute_average:
The vertical axis represents time in DPU cycles and the horizontal axis represents DPU resources (functional units, such as the five arithmetic-logical units ALU0 through ALU4). Clicking on any operation displays its properties in the Properties view; the Scheduled Time in the Properties view is basic-block-relative, so above the ADDU32V operation at cycle 66 is scheduled at cycle 13 of a basic block that starts at cycle 53; solid black horizontal lines separate basic blocks in the visualization. Hovering over any operation displays its dependencies on preceding operations. Show Edges and Hide Edges allow you to show or hide all dependency information. In a pipelined kernel, an operation may depend on an operation that occurs below it.
The VLIW scheduler in the compiler spc controls VLIW code generation, so the user does not have direct control over the generated kernel code, but the kernel visualization can provide a general picture of how efficiently the kernel uses DPU resources. Block unrolling or increasing the size of a basic block can provide the VLIW scheduler with more flexibility, leading to more efficient schedules.
10.6.2Reduce critical path
The critical path of the inner block of a kernel loop, reported in the performance analysis Tables section and shown as a set of red arcs in a kernel visualization, gives a lower bound on how long the block must take to execute. Program performance can be dramatically improved by shortening the critical path, allowing better VLIW scheduling. Suppose a block computes x, y, and z that depend on a, b, and c:
x = (flag) ? a : b;
y = x + c;
z = 7 * y;
As coded above, z depends on y, which in turn depends on x; the program cannot compute z until the computation of y is complete. Rewrite the code as follows:
x = (flag) ? a : b;
y = ((flag) ? a : b) + c;
z = ((flag) ? 7 * a : 7 * b) + 7 * c;
Here y and z depend only on a, b, and c, shortening the critical path for this code.
Rewriting simple conditionals using kernel intrinsic operation spi_vselect* generates simpler code with predicated execution (no break of control flow):
Conditional:
|
Predicated:
|
for (i = 0; i < MAX; i++) {
...
if (x < y) {
min = x;
} else {
min = y;
}
...
if (x < y) {
a = b;
}
...
}
|
for (i = 0; i < MAX; i++) {
...
min = spi_vselect32(x < y, x, y);
...
a = spi_vselect32(x < y, b, a);
...
}
|
10.6.4Software pipeline
Modulo software pipelining (SWP) rearranges the operations of a loop to construct a semantically equivalent loop with the shortest possible schedule length (in cycles), called the minimum iteration interval (MinII) of the loop. A single iteration of a pipelined loop may execute operations from several different iterations of the original loop. Execution of a pipelined loop suppresses some operations on some loop iterations to preserve the original loop semantics. Although the actual schedule length achieved by the scheduler is usually higher than MinII, reducing MinII usually reduces the achieved schedule length.
Software pipelining is the single most effective optimization for many inner loops. However, it can greatly increase compile time, and it does not always work on large loops. To software pipeline a loop, just add #pragma pipeline before the loop’s opening brace.
10.6.5Unroll
Loop unrolling makes a loop larger by replicating the loop body, thus providing the VLIW scheduler with opportunities to schedule the loop more efficiently. The __repeat__ keyword described in section __repeat__ above performs loop unrolling.
Kernel gsr_compute_average in spm_demo/gsr_pipeline.sc has a small loop that demonstrates the benefits of loop unrolling. It reads BLOCK_WIDTH (16 on SP16) RGB pixel color values from an input stream and accumulates sums of the RGB components:
for (i = 0; i < BLOCK_WIDTH; i += UNROLL) {
__repeat__(; UNROLL) {
spi_read(in_str, color);
r += spi_vshuffleu(0x0A0A0A02, color, 0);
g += spi_vshuffleu(0x09090901, color, 0);
b += spi_vshuffleu(0x08080800, color, 0);
}
}
If compiled without unrolling (i.e., with UNROLL=1), this loop contains only 13 operations scheduled into 25 cycles, resulting in a dismal 5.4% ALU utilization; the function requires over 1.2 milliseconds to process file sample.bmp. If the same source is compiled with the loop unrolled eight times using spc -D UNROLL=8, the loop instead contains 62 operations scheduled into 32 cycles, for 16.8% ALU utilization; the function requires only about 0.37 milliseconds to process sample.bmp, a dramatic 3x performance improvement.
Loop unrolling has some of the benefits of software pipelining, and the two optimizations can be combined. Loop unrolling can increase VLIW instruction memory usage, so the programmer should pay attention to total VLIW instruction memory usage when unrolling. In general, the programmer can realize most of the benefits of unrolling by unrolling a loop two or four times. The programmer should try different unroll values and compare the resulting performance.
Share with your friends: |