2 Comments An emerging standard uses pragmas to move parallel computations in C/C++ and Fortran to the gpu


[Click image to view at full size]



Download 367.7 Kb.
Page3/3
Date09.06.2018
Size367.7 Kb.
#53774
1   2   3

[Click image to view at full size]



Figure 4: High resolution nvvp timeline showing two kernels.

The Visual Profiler shows that only one data transfer occurs at the end of the data region as required by the copyout() clause.



[Click image to view at full size]



Figure 5: lower resolution nvvp timeline showing two kernels and copy at the end of the data region.

Removal of the data transfers speeds the OpenACC performance over the OpenMP version:

Speedup with copyin() and copy() clauses over OpenMP: 6.4x

Speedup with create() and copyout() clauses over OpenMP: 6.9x



An emerging standard uses pragmas to move parallel computations in C/C++ and Fortran to the GPU

Multidimensional Dynamic Arrays

The previous examples utilized static 2D globally accessible C arrays. Most applications utilize dynamic allocation of all data structures including multidimensional arrays that are frequently passed to functions and subroutines. A particular challenge for C/C++ programmers is that OpenACC transfers occur between contiguous regions of host and device memory. The use of non-contiguous multidimensional arrays (such as float ** arrays) is not recommended because they require individual transfers of each contiguous memory region.

The following example, matrix-acc-func.c, dynamically allocates the 2D matrices for the test and passes them to doTest(), which performs the matrix initializations and multiplication. This test utilizes the C-language (as of C99) restrict keyword that indicates the matrices do not overlap. For convenience, the 2D nature of the arrays was defined in the function to make array accesses straightforward:

?

1

2

3



4

5

6



int doTest(restrict float a[][SIZE], restrict float b[][SIZE],

       restrict float c[][SIZE], int size)

{

  …


    c[i][j] = 0.0f;

}


  • Example 7: Code snippet for straightforward 2D array indexing.

  • Of course, the programmer can pass the pointer to the contiguous region of memory and manually calculate the offsets into the multidimensional array as will be demonstrated in the next example.

  • ?

1

2

3



4

5


int doTest(restrict float *a, restrict float *b, restrict float *c, int size)

{

  …



    c[i*size+j] = 0.0f;

}


Example 8: Code snippet demonstrate manual calculation of the 2D array offset.

The matrix-acc-func.c example also demonstrates the use of the OpenACC pragma "#pragma acc loop independent". The independent clause tells the compiler to ignore its own dependency analysis and trust that the programmer knows the loops have no dependencies. Incorrect and non-deterministic program behavior can happen if the programmer is mistaken. Conversely, the OpenACC pragma "#pragma acc loop seq" tells the compiler to generate code that will execute sequentially on the device.



?

1

2

3



4

5

6



7

8

9



10

11

12



13

14

15



16

17

18



19

20

21



22

23

24



25

26

27



28

29

30



31

32

33



34

35

36



37

38

39



40

41

42



43

44

45



46

47

48



49

50

51



52

53

54



55

56

57



58

59

60



61

62

63



64

65

66



67

68

69



70

71

72



73

74

75



76

77

78



79

80

81



82

/* matrix-acc-func.c */

#include

#include

#define SIZE 1000

  

int doTest(restrict float a[][SIZE], 



       restrict float b[][SIZE], 

       restrict float c[][SIZE], int size)

{

  int i,j,k;



    

#pragma acc kernels create(a[0:size][0:size], b[0:size][0:size]) \

  copyout(c[0:size][0:size]) 

  {


    // Initialize matrices.

 #pragma acc loop independent

    for (i = 0; i < size; ++i) {

 #pragma acc loop independent

      for (j = 0; j < size; ++j) {

    a[i][j] = (float)i + j;

    b[i][j] = (float)i - j;

    c[i][j] = 0.0f;

      }

    }


      

    // Compute matrix multiplication.

 #pragma acc loop independent

    for (i = 0; i < size; ++i) {

 #pragma acc loop independent

      for (j = 0; j < size; ++j) {

 #pragma acc loop seq

    for (k = 0; k < size; ++k) {

      c[i][j] += a[i][k] * b[k][j];

    }


      }

    }


  }

}

  



int main()

{

  int i,j,k;



  int size=SIZE;

  float *a= (float*)malloc(sizeof(float)*size*size);

  float *b= (float*)malloc(sizeof(float)*size*size);

  float *c= (float*)malloc(sizeof(float)*size*size);

  

  

  doTest(a,b,c, size);



  

  free(a);

  free(b);

  

  // ****************



  // double-check the OpenACC result sequentially on the host

  // ****************

  float *seq= (float*)malloc(sizeof(float)*size*size);

  // Initialize the seq matrix

  for(i = 0; i < size; ++i) 

    for(j = 0; j < size; ++j) 

      seq[i*SIZE+j] = 0.f;

    


  // Perform the multiplication

  for (i = 0; i < size; ++i) 

    for (j = 0; j < size; ++j) 

      for (k = 0; k < size; ++k) 

    seq[i*size+j] += (i+k) * (k-j);

    


  // check all the OpenACC matrices

  for (i = 0; i < size; ++i)

    for (j = 0; j < size; ++j)

      if(c[i*size+j] != seq[i*size+j]) {

    printf("Error (%d %d) (%g, %g)\n", i,j, c[i*size+j], seq[i*size+j]);

    exit(1);

      }

  free(c);

  free(seq);

  

  printf("OpenACC matrix multiplication test was successful!\n");



    

  return 0;

}


Example 9: matrix-acc-func.c source code.

Using Data Allocated on the Device

OpenACC also provides the ability to use previously allocated device memory with the deviceptr() clause. The following example matrix-acc-alloc.c demonstrates how to allocate memory in main() with the OpenACC runtime method acc_malloc(). The pointer is then passed to doTest() where it is accessed via deviceptr(). The copyout() clause also includes the size of the contiguous region of memory. For timing purposes, this code utilizes a size specified by the user on the command-line.



?

1

2

3



4

5

6



7

8

9



10

11

12



13

14

15



16

17

18



19

20

21



22

23

24



25

26

27



28

29

30



31

32

33



34

35

36



37

38

39



40

41

42



43

44

45



46

47

48



49

50

51



52

53

54



55

56

57



58

59

60



61

62


/* matrix-acc-alloc.c */

#include

#include

#include

  

int doTest(restrict float *a, restrict float *b,



       restrict float *c, int size)

{

  int i,j,k;



    

#pragma acc kernels deviceptr(a, b) copyout(c[0:size*size-1]) 

  {

    // Initialize matrices.



 #pragma acc loop independent

    for (i = 0; i < size; ++i) {

 #pragma acc loop independent

      for (j = 0; j < size; ++j) {

    a[i*size+j] = (float)i + j;

    b[i*size+j] = (float)i - j;

    c[i*size+j] = 0.0f;

      }


    }

      


    // Compute matrix multiplication.

 #pragma acc loop independent

    for (i = 0; i < size; ++i) {

 #pragma acc loop independent

      for (j = 0; j < size; ++j) {

 #pragma acc loop seq

    for (k = 0; k < size; ++k) {

      c[i*size+j] += a[i*size+k] * b[k*size+j];

    }

      }


    }

  }


}

  

int main(int argc, char *argv[])



{

  int i,j,k;

   

  if(argc < 2) {



    fprintf(stderr,"Use: size (for size x size) matrices\n");

    return -1;

  }

  int size=atoi(argv[1]);



  float *a = (float *)acc_malloc(sizeof(float)*size*size);

  float *b = (float *)acc_malloc(sizeof(float)*size*size);

  float *c= (float*)malloc(sizeof(float)*size*size);

  

  printf("size = %d\n",size);



  

  doTest(a,b,c, size);

  

  acc_free(a); 



  acc_free(b);

  free(c);

  

  printf("OpenACC matrix multiplication test was successful!\n");



    

  return 0;

}


  • Example 10: Source code for matrix-acc-alloc.c.

Conclusion

OpenACC has been designed to provide OpenMP-style programmers with an easy transition to GPU programming. Following the common sense adage, "Make your life easy and use the highest level API first," OpenACC provides a natural starting point to transition any C or Fortran code to massive parallelism. For legacy code, OpenACC can be the only viable route to massively parallel coprocessors because it eliminates the need for a total rewrite of the software and Fortran is supported. As a result, OpenACC opens the door to scalable, massively parallel GPU (or, more generically, coprocessor) acceleration of millions of lines of legacy application code. Currently, OpenACC is supported by compilers that must be purchased from either PGI or CAPS Enterprise. The PGI compiler used in this article is free for evaluation but continued use after the trial period expires requires a license that must be purchased. As with OpenMP, it is assumed that open-source compilers will eventually provide free OpenACC support.

Profiling and informational compiler messages play a key role in achieving high performance in pragma-based programming. Instead of having to blindly add pragmas and then guess at the impact of each might have on an application, free tools like the NVIDIA Visual Profiler let the developer actually see what is happening during runtime on Windows, Linux, and Mac computers. Being able to see what effect OpenACC pragmas have on runtime behavior greatly speeds the OpenACC learning process as well as application acceleration.

My next article in this series will discuss the OpenACC memory and execution model including the gang and worker clauses plus more sophisticated ways to handle data.



Rob Farber is an analyst who writes frequently on High-Performance Computing hardware topics.

Download 367.7 Kb.

Share with your friends:
1   2   3




The database is protected by copyright ©ininet.org 2024
send message

    Main page