Stream User’s Guide



Download 0.95 Mb.
Page27/32
Date20.10.2016
Size0.95 Mb.
#6688
1   ...   24   25   26   27   28   29   30   31   32

10.1Pipelines


The Stream programming model frees the programmer from the burden of needing to specify the details of synchronization between parts of a stream processor. However, in many cases the programmer does need to understand the dependencies and resource requirements in Stream code in order to write an efficient program. The System MIPS program, the DSP MIPS program, the stream controller, and a kernel on the DPU may all be running simultaneously, and fully utilizing the stream processor’s power involves careful programming. This section describes some common pipeline optimization issues.


Stream processor hardware includes a stream controller that loads kernels to the DPU, runs kernels, and performs direct memory access (DMA) data transfers between memory and the LRF. When a pipeline function in a Stream program running on DSP MIPS executes a spi_load_* function, for example, it writes a stream command to the stream controller to initiate the transfer, and then it continues to execute subsequent code from the Stream program while the stream controller performs the data transfer. Stream commands have implicit dependencies: a spi_load_* command must wait for the completion of a previous spi_store_* command to the same buffer, a kernel may not begin execution until its argument streams are loaded, and so on. Stream commands also have resource requirements: the stream controller can only execute a single kernel at one time, for example. A stream controller command issues (begins execution) once all its dependencies and resource requirements are satisfied, and at some later time the command completes (finishes execution).
The point in time when DSP MIPS dispatches a stream operation to the stream controller is the operation’s dispatch point, the point when the operation begins execution is its issue point, and the point when the operation completes is its completion point. The interval between its issue point and its completion point is its execution time.

Here and in spide visualizations below, the vertical axis represents time, while the horizontal axis represents resources; for example, the diagram above might represent a stream load operation. Since the vertical axis represents time, the height of the rectangle indicates its execution time..

The interval between the dispatch point of a stream operation and the dispatch point of the previous stream operation is its dispatch time (see the diagram below). The interval between when its resources are available and when its dependencies are satisfied is its dependency delay. The time between when its resources are available and its dependencies are satisfied and its dispatch point is its dispatch delay.


For each type of resource (e.g., the resource on the left in the diagram), the sum of execution times, dependency delays, and dispatch delays over the entire program equals the total program execution time. To achieve optimal performance, a program should try to fully utilize the performance-limiting resource of the processor; in other words, the performance-limiting resource should be kept busy all the time. If it is not busy, either it must be waiting for a command to be dispatched to it (dispatch delay) or it must be waiting for a dependency to be satisfied so that a command may begin execution (dependency delay). To improve performance, the programmer should pack operations to reduce dispatch delays and dependency delays, and then tune operations to reduce execution time.


The total dispatch delay time of a pipeline divided by its total execution time is its dispatch-limited time. Section Dispatch delays below describes how to reduce dispatch delays. The total dependency delay time of a pipeline divided by its total execution time is its dependence-limited time. Section Dependency delays below describes how to reduce dependency delays.
Simulation of a profile mode program generates a profile that contains performance information. The remainder of this chapter describes the use of Stream tools to evaluate performance and suggests how to use performance data to optimize performance.


10.2Visualization

In spide, clicking on a profile file generated by profile mode simulation of a program opens a visualization of the program’s execution. Build spm_demo in sp16_profile mode, then run the testbench version. After it terminates, click on profile file testbench under build/sp16_profile/profile; the IDE opens testbench (Analysis) and testbench (Visual) views. Hit the Zoom to Fit button to the right of the visual view to see the entire profile:



In spide visualizations, the vertical axis represents time, with a time ruler along the left edge. The horizontal access represents resources: DSP MIPS execution, stream loads, stream stores, kernel executions, and miscellaneous operations (kernel microcode VLIW loads and loads/stores for array, scalar, and conditional stream kernel arguments). Since the vertical axis represents time, the height of a rectangle represents its duration. After spm_demo starts DSP MIPS execution, it executes kernel gsr_compute_average repeatedly, shown by the very tightly-packed rectangles near the top of the visualization. (What appear to be single rectangles above are actually stacks of many very thin rectangles, as zooming in shows.) Then the program takes a relatively long time to sort the block averages and find the mode (background color); this code runs only on DSP MIPS, with no stream or kernel operations. Finally, it repeatedly calls gsr_remove_background, shown by the tightly-packed rectangles at the bottom of the visualization. Hovering over any item in the visualization brings up a pop-up description.


Zoom buttons to the right of the visual view let you zoom in or out. Hit the ‘+’ zoom button several times to zoom in, then scroll to the group of loads, stores and kernel calls near the top of the visualization.

Clicking on any item produces information in the Properties view. The example above gives properties of one call of kernel gsr_compute_average: its total duration, when it was written, issued and completed, and the stream controller slot used by the operation.
Hovering over an item produces red lines that show its dependencies on other items. In the example above, the highlighted gsr_compute_average kernel execution depends on a stream load and a stream store, as well as on additional items. The top of the green ‘T’-shaped line above the highlighted gsr_compute_average kernel execution rectangle indicates when the DSP MIPS program wrote the kernel execution request to the stream controller, the top of the highlighted rectangle indicates when the stream controller issued the operation, and the bottom of the highlighted rectangle indicates when the operation completed.
The testbench (Analysis) view gives tables with information about program performance, identical to the tables produced by spperf. The next section describes the tables. The remainder of this chapter shows how to use the information from spide visualizations and tables to improve Stream program performance.



Download 0.95 Mb.

Share with your friends:
1   ...   24   25   26   27   28   29   30   31   32




The database is protected by copyright ©ininet.org 2024
send message

    Main page