10.5.1Dependency delays
Packing stream operations efficiently by removing dependency delays is an essential part of obtaining maximum performance from a stream processor. This section describes how to evaluate stream operation packing. It shows how a simple technique called double buffering can improve performance.
The following simple loop runs kernel k on successive strips of a buffer. spi_load_block loads input stream s1 and spi_store_block stores output stream s2, and here both use the same buffer buf1:
/* Case 1: load and store to same buffer. */
for (i = 0; i < NSTRIPS; i++) {
spi_load_block(s1, buf1, i * NBYTES, NRECS);
k(s1, s2);
spi_store_block(s2, buf1, i * NBYTES);
}
The execution trace looks like this:
Here kernel k depends on the preceding load: k uses s1 as an input stream, so k cannot begin until the s1 load completes. It also depends on the preceding store: k uses s2 as an output stream, so k cannot begin until the s2 store completes. Each store depends on the preceding kernel; it stores stream s2, an output stream of k, so it cannot begin until k completes. The load in the next loop iteration depends on the preceding store; because the store and the following load both use buffer buf1, the load cannot begin until the store completes. Each operation depends on its predecessor, so the loop operations are fully serialized; none of them may run in parallel.
Each load, kernel, and store here is serialized, not overlapping with other operations. As a result, the stream operations are not densely packed; each resource goes unused at some times. To improve performance, the loop must be rewritten to remove dependencies.
A simple modification to the source to use separate buffers for input and output improves the loop’s performance somewhat:
/* Case 2: load and store to separate buffers. */
for (i = 0; i < NSTRIPS; i++) {
spi_load_block(s1, buf1, i * NBYTES, NRECS);
k(s1, s2);
spi_store_block(s2, buf2, i * NBYTES);
}
Execution now looks like this:
Each store still must depend on the preceding kernel, but the load in the next loop iteration no longer depends on the preceding store, so a store and a subsequent load now can occur simultaneously. However, the load in the next iteration depends on kernel k, because it cannot overwrite the kernel’s input stream s1 while s1 is still in use by k. The packing is better than in the preceding case, but there is still room for improvement; each resource still goes unused at some times.
Note also that the load blocks here are somewhat longer than in the previous case. A load and a store run simultaneously, but now they contend for available DMA bandwidth, and the loads run slower than before as a result.
To further improve performance, use double buffering to avoid the delay caused by the dependency of each load on the preceding kernel execution. Double buffering essentially unrolls the loop once and performs successive loads and stores to separate streams rather than to the same stream, with loop operations rearranged. For clarity, the example below assumes the original loop count to be even; alternatively, the second load, second kernel call, and second store could each be conditionalized with if (i + 1 < NSTRIPS).
/* Case 3: Double buffer. DPU-limited. */
for (i = 0; i < NSTRIPS; i += 2) {
offset1 = i * NBYTES;
offset2 = offset1 + NBYTES;
spi_load_block(s1a, buf1, offset1, NRECS);
spi_load_block(s1b, buf1, offset2, NRECS);
k(s1a, s2a);
k(s1b, s2b);
spi_store_block(s2a, buf2, offset1);
spi_store_block(s2b, buf2, offset2);
}
Execution of this loop looks like this after its early stages:
The kernel execution resource is fully packed here. The loop is now DPU-limited: it keeps the DPU fully occupied while the required loads and stores occur in parallel with kernel execution. Further performance optimization requires improving the performance of the kernel.
If kernel execution is faster than the required loads and stores, performance is DMA-limited rather than DPU-limited. Use the same code as above, but with larger loads and stores and a faster kernel:
/* Case 4: Double buffer. Same code but faster kernel; DMA-limited. */
for (i = 0; i < NSTRIPS; i += 2) {
offset1 = i * NBYTES2;
offset2 = offset1 + NBYTES2;
spi_load_block(s1a, buf1, offset1, NRECS2);
spi_load_block(s1b, buf1, offset2, NRECS2);
k2(s1a, s2a);
k2(s1b, s2b);
spi_store_block(s2a, buf2, offset1);
spi_store_block(s2b, buf2, offset2);
}
Execution performs continuous loads, so performance of the loop is now DMA-limited. Further performance improvement requires tuning the loads and stores, for example by combining pipelines or utilizing DRAM burst width fully.
The stream controller can perform only a single indexed load or store at a time. In this case, double buffering is of limited benefit; it allows kernel execution and loads or stores to be parallelized, but the indexed loads and indexed stores cannot be parallelized. Consider code similar to the above example, but with spi_load_index and spi_store_index rather than spi_load_block and spi_store_block:
/* Case 5: Indexed loads and stores. */
for (i = 0; i < NSTRIPS; i += 2) {
offset1 = i * NBYTES2;
offset2 = offset1 + NBYTES2;
spi_load_index(s1a, buf1, offset1, idx, 1, 1, NRECS2);
spi_load_index(s1b, buf1, offset2, idx, 1, 1, NRECS2);
k2(s1a, s2a);
k2(s1b, s2b);
spi_store_index(s2a, buf2, offset1, idx, 1, 1);
spi_store_index(s2b, buf2, offset2, idx, 1, 1);
}
Because the stream controller cannot perform an indexed load and an indexed store simultaneously, the loads and stores are serialized in spite of the double buffering:
If the index stream is fairly simple (for example, selecting every other line of a rectangular array), it might be worth using block loads and stores and performing the work of the index stream in the kernel rather than using indexed loads and stores, because the block loads and stores can be parallelized but the indexed loads and stores cannot.
10.5.2Dispatch delays
The time required for DSP MIPS to send an operation to the stream controller is typically about 500 nanoseconds for a load or store or about 1,000 nanoseconds for a kernel. Only one DSP MIPS sends stream operations but multiple resources can execute stream operations in parallel, so execution of each stream operation should take substantially longer than the time required to send it; if it takes longer to send each stream operation than to execute it, a pipeline will necessarily be dispatch limited. As a rule of thumb, stream operations should on average take at least twice as long to execute as to send: loads and stores should take at least 500 * 2 = 1,000 nanoseconds to execute, and kernels should take at least 1,000 * 2 = 2,000 nanoseconds. DSP MIPS can send stream operations ahead of execution, so a few stream operations can take less time to send than to execute, but then other stream operations should take longer to execute.
A stream command may have dependencies on other stream commands and resource requirements. If its dependencies and resource requirements are satisfied before a command is written, the delay between when its dependencies and requirements are satisfied and when it is written is called its dispatch delay. Dispatch delay has several components:
-
user code time required by DSP MIPS code before the stream operation,
-
result wait time required to wait for a kernel result,
-
dispatch wait time for an available stream controller dispatch slot, and
-
send time to send the operation.
spi_count and spi_out operations can introduce result wait time, as DSP MIPS must wait for a stream count or kernel scalar output. Wait time can sometimes be avoided by code rearrangement, moving spi_count or spi_out calls forward in the code to follow kernel calls or stream operations. This allows DPU execution to continue without waiting for the previous result.
Before:
|
After:
|
kernel1(..., x_out);
x = spi_out(x_out);
kernel2(...);
|
kernel1(..., x_out);
kernel2(...);
x = spi_out(x_out);
|
Dispatch waits are caused by stream controller hardware limits: the stream controller has queues of twelve load/store operation slots and four kernel slots, so DSP MIPS must wait to write a stream controller command if the required slots are unavailable. Reordering of loads, stores and kernel invocations can eliminate dispatch slot waits in some cases.
Share with your friends: |