2 Comments An emerging standard uses pragmas to move parallel computations in C/C++ and Fortran to the gpu


[Click image to view at full size]



Download 367.7 Kb.
Page2/3
Date09.06.2018
Size367.7 Kb.
#53774
1   2   3

[Click image to view at full size]



Figure 2: Automated analysis performed by the NVIDIA Visual Profiler.

Matrix Multiply Is an Ideal Case

Most computationally oriented scientists and programmers are familiar with BLAS (the Basic Linear Algebra Subprograms) library. BLAS is the de facto programming interface for basic linear algebra.

BLAS is structured according to three different levels with increasing data and runtime requirements.


  1. Level-1: Vector-vector operations that require O(N) data and O(N) work. Examples include taking the inner product of two vectors, or scaling a vector by a constant multiplier.

  2. Level-2: Matrix-vector operations that require O(N2) data and O(N2) work. Examples include matrix-vector multiplication or a single right-hand-side triangular solve.

  3. Level-3 Matrix-vector operations that require O(N2) data and O(N3) work. Examples include dense matrix-matrix multiplication.

The following table describes the amount of work that is performed by each BLAS level assuming that N floating-point values are transferred from the host to the device. This table does not take into account the time required to transfer the data back to the host.

BLAS level

Data

Work

Work per Datum

1

O(N)

O(N)

O(1)

2

O(N2)

O(N2)

O(1)

3

O(N2)

O(N2)

O(N)

Table 2: Work per datum for the three BLAS levels.

Matrix multiply is an ideal example for OpenACC acceleration because the data transfers become less important as the size of the matrices increase. Matrix multiply is a level-3 BLAS operation that performs O(N) work for every floating-point value transferred to the device. The effect of this high computational density can be seen in the following plot of wall clock time on a dedicated system as the problem size increases. Multiplying 1k by 1k square matrices results in a 1.7 speedup of matrix-acc.c over matrix-omp.c when running on an NVIDIA C2050 GPU compared with a 2.65 GHz quad-core Intel Xeon E5630 processor. Increasing the matrix sizes to 11k by 11k shows a 6.4x speedup over OpenMP. This empirically demonstrates the high work per datum runtime behavior of matrix multiply. Similar speedups will occur for other high work-per-datum computations as well.





Figure 3: Runtime behavior by matrix size of OpenACC and OpenMP implementations (lower is better).

The Three Rules of Coprocessor Programming

Matrix multiply is an excellent teaching tool but most real-world calculations do not exhibit such ideal behavior. Instead, the programmer must be creative and pay close attention to data transfers and computational density on the OpenACC device(s). High performance can be achieved when the compute intensive portions of the application conform to the following three rules of high-performance coprocesser programming. If not, expect application performance to be either PCIe or device memory bandwidth limited.



  1. Transfer the data across the PCIe bus onto the device and keep it there.

  2. Give the device enough work to do.

  3. Focus on data reuse within the coprocessor(s) to avoid memory bandwidth bottlenecks.

Using the Create Clause to Allocate on the OpenACC Device

It is easy to create the matrices on the OpenACC device and initialize them. Creating and initializing data on the OpenACC device conforms to the first rule and avoids data transfers. The following example, matrix-acc-create.c demonstrates the use of the create() clause in a kernels region.



?

1

2

3



4

5

6



7

8

9



10

11

12



13

14

15



16

17

18



19

20

21



22

23

24



25

26

27



28

29

30



31

32

33



/* matrix-acc-create.c */

#define SIZE 1000

float a[SIZE][SIZE];

float b[SIZE][SIZE];

float c[SIZE][SIZE];

  

int main()



{

  int i,j,k;

    

#pragma acc kernels create(a,b) copyout(c)



  { // start of kernels

    // Initialize matrices.

    for (i = 0; i < SIZE; ++i) {

      for (j = 0; j < SIZE; ++j) {

    a[i][j] = (float)i + j;

    b[i][j] = (float)i - j;

    c[i][j] = 0.0f;

      }


    }

      


    // Compute matrix multiplication.

    for (i = 0; i < SIZE; ++i) {

      for (j = 0; j < SIZE; ++j) {

    for (k = 0; k < SIZE; ++k) {

      c[i][j] += a[i][k] * b[k][j];

    }


      }

    }


  } // end of kernels

  

  return 0;



}

Example 6: matrix-acc-create.c

Following is the nvvp timeline showing that two kernels are now running on the GPU.




Download 367.7 Kb.

Share with your friends:
1   2   3




The database is protected by copyright ©ininet.org 2024
send message

    Main page