2 Comments An emerging standard uses pragmas to move parallel computations in C/C++ and Fortran to the gpu



Download 367.7 Kb.
Page1/3
Date09.06.2018
Size367.7 Kb.
  1   2   3
Easy GPU Parallelism with OpenACC

By Rob Farber, June 11, 2012



2 Comments

An emerging standard uses pragmas to move parallel computations in C/C++ and Fortran to the GPU

This is the first in a series of articles by Rob Farber on OpenACC directives, which enable existing C/C++ and Fortran code to run with high performance on massively parallel devices such as GPUs. The magic in OpenACC lies in how it extends the familiar face of OpenMP pragma programming to encompass coprocessors. As a result, OpenACC opens the door to scalable, massively parallel GPU — accelerating millions of lines of legacy application code without requiring a new language such as CUDA or OpenCL, or fork application source tree to support multiple languages

OpenACC is a set of standardized, high-level pragmas that enables C/C++ and Fortran programmers to utilize massively parallel coprocessors with much of the convenience of OpenMP. A pragma is a form of code annotation that informs the compiler of something about the code. In this case, it identifies the succeeding block of code or structured loop as a good candidate for parallelization. OpenMP is a well-known and widely supported standard that defines pragmas programmers have used since 1997 to parallelize applications on shared memory multicore processors. The OpenACC standard has generated excitement because it preserves the familiarity of OpenMP code annotation while extending the execution model to encompass devices that reside in separate memory spaces. To support coprocessors, OpenACC pragmas annotate data placement and transfer as well as loop and block parallelism.

The success of GPU computing in recent years has motivated compiler vendors to extend the OpenMP shared memory pragma programming approach to coprocessors. Approved by the OpenACC standards committee in November 2011, the OpenACC version 1.0 standard creates a unified syntax and prevents a "tower of babel" proliferation of incompatible pragmas. Adoption has been rapid by companies such as NVIDIA, PGI (The Portland Group), CAPS Enterprise, and Cray.



Make Your Life Simple

Pragmas and high-level APIs are designed to provide software functionality. They hide many details of the underlying implementation to free a programmer's attention for other tasks.A colleague humorously refers to pragma-based programming as a negotiation that occurs between the developer and the compiler. Note that pragmas are informational statements provided by the programmer to the assist the compiler. This means that pragmas are not subject to the same level of syntax, type, and sanity checking as the rest of the source code. The compiler is free to ignore any pragma for any reason including: it does not support the pragma, syntax errors, code complexity, unresolved (or potentially unresolved) dependencies, edge cases where the compiler cannot guarantee that vectors or matrices do not overlap, use of pointers, and many others. Profiling tools and informational messages from the compiler about parallelization, or an inability to parallelize, are essential to a successful to achieving high performance.

An OpenACC pragma for C/C++ can be identified from the string "#pragma acc" just like an OpenMP pragma can be identified from "#pragma omp". Similarly, Fortran pragmas can be identified by "! $acc". Always ensure that these strings begin all OpenACC (or OpenMP) pragmas. Moreover, it is legal to mix OpenMP, OpenACC, and other pragmas in a single source file.

OpenACC Syntax

OpenACC provides a fairly rich pragma language to annotate data location, data transfer, and loop or code block parallelism. The syntax of OpenACC pragmas (sometimes referred to as OpenACC directives) is:



  • C/C++: "#pragma acc directive-name [clause [[,] clause]…] new-line"

  • Fortran: "!$acc directive-name [clause [[,] clause]…] new-line"

OpenACC pragmas in C/C++ are somewhat more concise than their Fortran counterparts as the compiler can determine a code block from the curly bracket "{}" notation. The OpenACC specification also requires that the _OPENACC preprocessor macro be defined when compiling OpenACC applications. This macro can be used for the conditional compilation of OpenACC code. The _OPENACC macro name will have a value yyyymm where yyyy is the year and mm is the month designation of the version of the OpenACC directives supported by the implementation.

Table 1 shows a list of OpenACC version 1.0 pragmas and clauses.



!$acc kernels

!$acc parallel

!$acc data

!$acc loop

!$acc wait

#pragma acc kernels

#pragma acc parallel

#pragma acc data

#pragma acc loop

#pragma acc wait

Clauses

Clauses

Clauses

Clauses




if()

if()

if()

collapse()




async()

async()

async()

within kernels region




copy()

num_gangs()




gang()




copyin()

num_workers()




worker()




copyout()

vector_length()




vector()




create()

reduction()




seq()




present()

copyin()

copyin()

private()




present_or_copy()

copyout()

copyout()

reduction()




present_or_copyin()

create()

create()







present_or_copyout()

present()

present()







present_or_create()

present_or_copy()

deviceptr() in .c







deviceptr()

present_or_copyin()

deviceptr() in .f










present_or_copyout()













present_or_create()













deviceptr()













private()













firstprivate()










Table 1. Currently supported OpenACC pragmas.

Two OpenACC environment variables, ACC_DEVICE_TYPE and ACC_DEVICE_NUM can be set by the user:



  • ACC_DEVICE_TYPE: Controls the default device type to use when executing accelerator parallel and kernels regions, when the program has been compiled to use more than one different type of device.
    The allowed values of this environment variable are implementation-defined.
    Examples include ACC_DEVICE_TYPE=NVIDIA.

  • ACC_DEVICE_NUM: Specifies the default device number to use when executing accelerator regions. The value of this environment variable must be a nonnegative integer between zero and the number of devices of the desired type attached to the host.
    If the value is zero, the implementation-defined default is used.
    If the value is greater than the number of devices attached, the behavior is implementation-defined.
    On multi-GPU systems, this variable will avoid the TDR (Timeout Detection and Recovery) watchdog reset for long-running GPU applications by running on the GPU that is not used for the display. (Consult the vendor driver information to see how to modify the TDR time for your operation system.

In addition, OpenACC provides several runtime routines: acc_get_num_devices(), acc_set_device_type(), acc_get_device_type(), acc_set_device_num(), acc_get_device_num(), acc_async_test(), acc_async_test_all(), acc_async_wait(), acc_async_wait_all(), acc_init(), acc_shutdown(), acc_on_device(), acc_malloc(), acc_free().

Vendor specific information can be found on the Nvidia, PGI, CAPS, and Cray websites.



Building, Running and Profiling a First Program

This tutorial uses The Portland Group (PGI) Accelerator C and Fortran compilers release 12.5 with OpenACC support. PGI has been deeply involved in developing pragma-based programming for coprocessors since 2008, plus they are a founding member of the OpenACC standards body. The PGI OpenACC compilers currently target NVIDIA GPUs, but it is important to note that OpenACC can support other coprocessors (such as AMD GPUs and Intel MIC) as well. More information about the PGI compilers is available on the company's website.



An emerging standard uses pragmas to move parallel computations in C/C++ and Fortran to the GPU

How To Try Out OpenACC

An extended 30-day trial license for the PGI software can be obtained by registering with NVIDIA. The Portland Group also provides a free 15 day OpenACC trial license, which can be obtained by following the following three steps:

1. Download any of the available software packages for your operating system.
2. Review the PGI Installation Guide [PDF] or the PGI Visual Fortran Installation Guide [PDF] and configure your environment.
3. Generate the trial license keys. Note the trial keys and all executable files compiled using them will cease operating at the end of the trial period.

The following set of examples multiply two matrices a and b and store the result in matrix c. They utilize a useful set of basic OpenACC data transfer, parallelization, and memory creation/access clauses. A C-language OpenMP matrix multiply is also provided to show the similarity between OpenACC and OpenMP and provide CPU and GPU performance comparisons. While the PGI matrix multiplication performance is good, please look to the highly optimized BLAS (Basic Linear Algebra Subroutines) packages such as CUBLAS and phiGEMM for production GPU and hybrid CPU + GPU implementations.

Following is our first OpenACC program, matix-acc-check.c. This simple code creates a static set of square matrices (a,b,c,seq), initializes them, and then performs a matrix multiplication on the OpenACC device. The test code then performs the matrix multiplication sequentially on the host processor and double-checks the OpenACC result.

?

1

2

3



4

5

6



7

8

9



10

11

12



13

14

15



16

17

18



19

20

21



22

23

24



25

26

27



28

29

30



31

32

33



34

35

36



37

38

39



40

41

42



43

44

45



46

47

48



49

50

51



52

53

54



55

/* matrix-acc-check.c */

#define SIZE 1000

float a[SIZE][SIZE];

float b[SIZE][SIZE];

float c[SIZE][SIZE];

float seq[SIZE][SIZE];

  

int main()



{

  int i,j,k;

    

  // Initialize matrices.



  for (i = 0; i < SIZE; ++i) {

    for (j = 0; j < SIZE; ++j) {

      a[i][j] = (float)i + j;

      b[i][j] = (float)i - j;

      c[i][j] = 0.0f;

    }


  }

    


  // Compute matrix multiplication.

#pragma acc kernels copyin(a,b) copy(c)

  for (i = 0; i < SIZE; ++i) {

    for (j = 0; j < SIZE; ++j) {

      for (k = 0; k < SIZE; ++k) {

    c[i][j] += a[i][k] * b[k][j];

      }

    }


  }

  

  // ****************



  // double-check the OpenACC result sequentially on the host

  // ****************

  // Initialize the seq matrix

  for(i = 0; i < SIZE; ++i) 

    for(j = 0; j < SIZE; ++j) 

      seq[i][j] = 0.f;

    

  // Perform the multiplication



  for (i = 0; i < SIZE; ++i) 

    for (j = 0; j < SIZE; ++j) 

      for (k = 0; k < SIZE; ++k) 

    seq[i][j] += a[i][k] * b[k][j];

    

  // check all the OpenACC matrices



  for (i = 0; i < SIZE; ++i)

    for (j = 0; j < SIZE; ++j)

      if(c[i][j] != seq[i][j]) {

    printf("Error %d %d\n", i,j);

    exit(1);

      }


  printf("OpenACC matrix multiplication test was successful!\n");

    


  return 0;

}


Example 1: matrix-acc-check.c source code.

The OpenACC pragma tells the compiler the following:



  • #pragma acc: This is an OpenACC pragma.

  • kernels: A kernels region.
    No jumps are allowed into/out of the kernels region.
    Loops will be sent to the OpenACC device.
    The scope of the kernels region code block is denoted by the curly brackets in a C program.

  • copyin(): copy the contiguous region of memory from the host to the device.
    The variables, arrays or subarrays in the list have values in the host memory that need to be copied to the device memory.
    If a subarray is specified, then only that subarray of the array needs to be copied.

  • copy(): copy the contiguous memory region from the host to the device and back again.
    The variables, arrays or subarrays in the list have values in the host memory that need to be copied to the device memory.
    If a subarray is specified, then only that subarray of the array needs to be copied.
    The data is copied to the device memory before entry to the kernles region, and data copied back to the host memory when the code block is complete.

The source code is compiled with the pgcc compiler and a successful test is indicated after the application runs as shown below:

?

1

2

3



pgcc -acc -fast -Minfo matrix-acc-check.c -o matrix-acc-check

./matrix-acc-check

OpenACC matrix multiplication test was successful!


The source code for matrix-acc.c was created by removing the italicized code from matric-acc-check.c to simplify the following discussion.

?

1

2

3



4

5

6



7

8

9



10

11

12



13

14

15



16

17

18



19

20

21



22

23

24



25

26

27



28

29

30



/* matrix-acc.c */

#define SIZE 1000

float a[SIZE][SIZE];

float b[SIZE][SIZE];

float c[SIZE][SIZE];

  

int main()



{

  int i,j,k;

    

  // Initialize matrices.



  for (i = 0; i < SIZE; ++i) {

    for (j = 0; j < SIZE; ++j) {

      a[i][j] = (float)i + j;

      b[i][j] = (float)i - j;

      c[i][j] = 0.0f;

    }


  }

    


  // Compute matrix multiplication.

#pragma acc kernels copyin(a,b) copy(c)

  for (i = 0; i < SIZE; ++i) {

    for (j = 0; j < SIZE; ++j) {

      for (k = 0; k < SIZE; ++k) {

    c[i][j] += a[i][k] * b[k][j];

      }

    }


  }

  return 0;

}


Example 2: matrix-acc.c source code.

Note the similarity between matrix-acc.c and the following OpenMP implementation, matrix-omp.c. Only the pragmas are different as the OpenACC pragma includes copy operations that are not required in the OpenMP implementation.



?

1

2

3



4

5

6



7

8

9



10

11

12



13

14

15



16

17

18



19

20

21



22

23

24



25

26

27



28

29

30



/* matrix-omp.c */

#define SIZE 1000

float a[SIZE][SIZE];

float b[SIZE][SIZE];

float c[SIZE][SIZE];

  

int main()



{

  int i,j,k;

    

  // Initialize matrices.



  for (i = 0; i < SIZE; ++i) {

    for (j = 0; j < SIZE; ++j) {

      a[i][j] = (float)i + j;

      b[i][j] = (float)i - j;

      c[i][j] = 0.0f;

    }


  }

    


  // Compute matrix multiplication.

#pragma omp parallel for default(none) shared(a,b,c) private(i,j,k)

  for (i = 0; i < SIZE; ++i) {

    for (j = 0; j < SIZE; ++j) {

      for (k = 0; k < SIZE; ++k) {

    c[i][j] += a[i][k] * b[k][j];

      }

    }


  }

  return 0;

}


Example 3: matrix-omp.c source code.

Fortran programmers will find the corresponding source code in Example 4. Again, the OpenACC pragmas annotate data movement with the copy() and copyin() clauses. Note that the C-based pragmas know the extent of the code block due to the use of curly brackets while the Fortran version must explicitly specify the end of the scope of the pragma with "!$acc end …".



?

1

2

3



4

5

6



7

8

9



10

11

12



13

14

15



16

17

18



19

20

21



22

23

24



25

26

27



28

29


!     matrix-acc.f

      program example1 

      parameter ( n_size=1000 )

      real*4, dimension(:,:) :: a(n_size,n_size) 

      real*4, dimension(:,:) :: b(n_size,n_size) 

      real*4, dimension(:,:) :: c(n_size,n_size) 

  

!     Initialize matrices (values differ from C version)



      do i=1, n_size 

         do j=1, n_size 

            a(i,j) = i + j;

            b(i,j) = i - j;

            c(i,j) = 0.;

         enddo 

      enddo 

  

!$acc data copyin(a,b) copy(c) 



!$acc kernels loop 

!     Compute matrix multiplication.

      do i=1, n_size 

         do j=1, n_size 

            do k = 1, n_size

               c(i,j) = c(i,j) + a(i,k) * b(k,j)

            enddo 

         enddo 

      enddo 

!$acc end data

      end program example1


Example 4: matrix-acc.f source code.

An emerging standard uses pragmas to move parallel computations in C/C++ and Fortran to the GPU

The following commands compile the source code for each application with the PGI C and Fortran compilers. These commands assume the source code has been saved to the file name provided in the comment at the beginning of each example.

pgcc -fast -mp -Minfo -Mconcur=allcores matrix-omp.c -o matrix-omp
pgcc -fast -acc -Minfo matrix-acc.c -o matrix-acc-gpu
pgfortran -fast -acc -Minfo matrix-acc.f -o matrix-acc-gpuf

The command line arguments to the PGI C compiler (pgcc) and Fortran compiler (pgfortran) are:



  • -fast: Chooses generally optimal flags for the target platform.

  • -mp: Interpret OpenMP pragmas to explicitly parallelize regions of code for execution by multiple threads on a multi-processor system.

  • -acc: Interpret OpenACC pragmas.

  • -Minfo: Emit useful information to stderr.

  • -Mconcur: Instructs the compiler to enable auto-concurrentization of loops.

The Portland Group also provides a profiling capability that can be enabled via the PGI_ACC_TIME environment variable. By default, profiling is not enabled. Setting PGI_ACC_TIME to a positive integer value enables profiling while a negative value will disable it. The profiling overhead is minimal because the runtime only reports information collected by the GPU hardware performance counters. The wealth of information gathered by the runtime profiler can be seen in the output generated by matrix-acc-gpu after setting PGI_ACC_TIME=1:

?

1

2

3



4

5

6



7

8

9



10

11

12



rmfarber@bd:~/PGI/example1$ ./matrix-acc-gpu

  

Accelerator Kernel Timing data



/home/rmfarber/PGI/example1/matrix-acc.c

  main


    21: region entered 1 time

        time(us): total=139658 init=88171 region=51487

                  kernels=43848 data=7049

        w/o init: total=51487 max=51487 min=51487 avg=51487

        25: kernel launched 1 times

            grid: [63x63]  block: [16x16]

            time(us): total=43848 max=43848 min=43848 avg=43848


Example 5: Runtime profile output when PGI_ACC_TIME=1 for matrix-acc-gpu.

This output from the PGI runtime profiling tells us that the application spent 7 milliseconds transferring data and 43 milliseconds computing the matrix multiply kernel.



It is possible to create a timeline plot using the NVIDIA Visual Profiler (nvvp), which runs on Windows, Linux and Mac computers. (The nvvp application was previously known as computeprof.) The timeline is a new feature in the CUDA 4.2 release and is extremely useful!

[Click image to view at full size]



Figure 1: nvvp timeline for matrix-acc-gpu.

Notice that there are:



  • Three host to device data transfers at the start of the computation. These transfers correspond to the copyin() clauses for matrices a and b plus the copy() clause for matrix c.

  • A GPU computation that requires 39.1% of the time for kernel main_24_gpu. A helpful feature of the PGI OpenACC compiler is that it intelligently labels the kernel with the routine name and line number to make these timelines intelligible.

  • A single data transfer back from the device to the host, which was required by the copy clause for matrix c at the end of the kernel.

The visual profiler provides an interactive display of the timeline. A larger screenshot would show the calls to the driver API for the CUDA context setup and the data transfers along with a host of other information. In addition, the nvvp profiler will analyze the application and provide automated suggestions. This requires running the application many times. It is recommended to look at the timeline first as this only requires running the application once.

For example, the following screenshot shows the initial analysis of the timeline shown in Figure 1:




Download 367.7 Kb.

Share with your friends:
  1   2   3




The database is protected by copyright ©ininet.org 2020
send message

    Main page