Introduction

Overview

Teaching: 20 min
Exercises: 0 min

Questions

How shared memory parallel programs work?

What is OpenMP?

How to write and compile parallel programs in C?

Objectives

Understand the shared memory programming environment provided by OpenMP

Learn how to write and compile simple parallel programs in C

Shared Memory OpenMP Programming

As we learned in the General Parallel Computing lesson, parallel programs come in two broad flavors: shared-memory and distributed memory or message-passing. In this lesson, we will be looking at shared-memory programming, with a focus on Open Multi-Processing (OpenMP) programming.

In any parallel program, the general idea is to have multiple threads of execution so that you can break up your problem and have each thread handle one part. These multiple threads need to be able to communicate with each other as your program runs. In a shared-memory program, this communication happens through the use of global variables stored in the global memory of the computer running the code. This means that communication between the various threads is extremely fast, as it happens at the speed of local memory (RAM) access. The drawback is that your program will be limited to a single physical machine (compute node of HPC network), since all threads need to be able to see the same RAM.

OpenMP is one way of writing shared-memory parallel programs. OpenMP is a specification, which has been implemented by many vendors.

The OpenMP effort began in 1996 when Accelerated Strategic Computing Initiative of the DOE brought together a handful of computer vendors including HP, IBM, Intel, SGI and DEC to create a portable API for shared memory computers. Vendors do not typically work well together unless an outside force encourages cooperation. So this committee communicated that DOE would only purchase systems with a portable API for shared memory programming.

The current OpenMP v.5.1 specification is 600+ pages long, but you need to know only a very small fraction of it to be able to use it in your code.

OpenMP standard describes extensions to a C/C++ or FORTRAN compiler. OpenMP libraries are built into a compiler, and this means that you need use a compiler that supports OpenMP. There are different OpenMP implementations (compilers) and not all sections of OpenMP specifications are equally supported by all compilers. It is up to programmers to investigate the compiler they want to use and see if it supports the parts of OpenMP specification that they wish to use. Luckily, the vast majority of OpenMP behaves the way you expect it to with most modern compilers. When possible, we will try and highlight any odd behaviors. All commonly used compilers such as gcc, clang, Intel, Nvidia HPC, and Absoft support OpenMP.

View OpenMP specifications
View Full List of Compilers supporting OpenMP
For an overview of the past, present and future of the OpenMP read the paper “The Ongoing Evolution of OpenMP”.

OpenMP Execution Model

The philosophy of OpenMP is to not sacrifice ease of coding and maintenance in the name of performance.

OpenMP programs realizes parallelism through the use of threads. Recollect that a thread is the smallest unit of computing that can be scheduled by an operating system. In other words a thread is a subroutine that can be scheduled to run autonomously. Threads exist within the resources of a single process. Without the process, they cease to exist.
OpenMP offers the programmer full control over parallelization. It is an explicit, not automatic programming model.
OpenMP uses the so-called fork-join model of parallel execution:
- OpenMP programs start as a single process, the master thread. The master thread executes sequentially. When the first parallel region is encountered. The master thread creates a team of parallel threads.We call this “forking”.
- The statements in the parallel region of the program are executed in parallel by various team threads.
- When the team threads complete the statements in the parallel region construct, they synchronize and terminate, leaving only the master thread. We call this “joining”.
OpenMP divides the memory into two types: Global (or shared) memory, and thread-local memory. Every thread can read and write the global memory, but each thread also gets a little slice of memory that can only be read or written by that thread.

Compiling OpenMP programs

Since OpenMP is meant to be used with either C/C++ or FORTRAN, you will need to know how to work with at least one of these languages. This workshop will use C as the language for the examples. Before we start using a compiler let’s load the most recent environment module:

module load StdEnv/2020

You don’t need to run this command on the real CC clusters because StdEnv/2020 is the default, but we need to run it on the training cluster.

As a reminder, a simple hello world program in C would look like the following.

Compiling C code
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char **argv) {
  printf("Hello World\n");
}
In order to compile this code, you would need to use the following command:
gcc -o hello hello_world.c
This gives you an executable file “hello” that will print out the text “Hello World”. You run it with the command:
./hello
Hello World
If you don’t specify the output filename with the -o option compiler will use the default output name “a.out” (assembler output).

GCC on Compute Canada

Currently the default environment on the general purpose clusters (Beluga, Cedar, Graham) is StdEnv/2020. The default compilers available in this environment on Graham and Beluga are Intel/2020.1.217 and gcc/9.3.0. On Cedar the default compiler is gcc/8.4.0.

To load another compiler you can use the command module load. For example, the command to load gcc version 10.2.0 is:
module load gcc/10.2.0

A Very Quick Introduction to C

Preprocessor directives

The #include directive tells the preprocessor to insert the contents of another file into the source code.
Include directives are typically used to include header files for C functions that are defined outside of the current source file:
```
#include <header_file> // Search in a predefined folders
#include "header_file" // Search in the current directory
```
Header files contain declarations of functions, variables and macros.
The #define directive declares constant values to be used throughout your code. For example it is a convenient way to change some parameters used throughout the code:
```
   # define SIZE 1000
```

The conditional group #ifdef.

  #ifdef MACRO
  controlled code
  #endif /* MACRO */

The block inside #ifdef ... #endif statements is called a conditional group. Controlled code will be included in the output of the preprocessor only if MACRO (a block of C statements) is defined.

Basic Syntax

Curly braces { ... } are used to group statements into a block of statements.
Statement terminator is ;

For Loop syntax:

for (initialization; condition test; increment or decrement)
{
       //Statements to be executed repeatedly
}

You can skip any statements from for loop, but semicolons must be retained. For example, you can do initialization before loop:
```
i=10;
for(;i<100; i++)
```
Skipping all loop statements will result in an infinite loop: for(;;)

Defining Functions

Function definitions have the following format:

  return_type function_name( parameter list ) {
     body of the function
  }

Example:

int max(int num1, int num2)
  {
     int result;
     if (num1 > num2)
        result = num1;
     else
        result = num2;
     return result;
  }

Using Memory

Static arrays are allocated on stack.
Size of stack-allocated arrays is limited by system-dependent threshold. Programs that attempt to stack-allocate arrays requiring more than this threshold will therefore crash as soon as they try to use the array.

Dynamic memory allocation is when an executing program requests that the operating system give it a block of main memory. The program then uses this memory for some purpose.

Key Points

Shared-memory parallel programs break up large problems into a number of smaller ones and execute them simultaneously

OpenMP programs are limited to a single physical machine

OpenMP libraries are built into all commonly used C, C++, or Fortran compilers

Hello World

Overview

Teaching: 20 min
Exercises: 10 min

Questions

How do you compile and run an OpenMP program?

What are OpenMP pragmas?

How to identify threads?

Objectives

Write, compile and run a multi-threaded program where each thread prints “hello world”.

Adding parallelism to a program

Since OpenMP is an extension to the compiler, you need to be able to tell the compiler when and where to add the code necessary to create and use threads for the parallel sections. This is handled through special statements called pragmas. To a compiler that doesn’t understand OpenMP, pragmas look like comments. The basic forms for C/C++ and Fortran are:

#pragma omp < OpenMP directive >

!$OMP < OpenMP directive >

In C all OpenMP - specific directives start with #pragma omp.

How do we add in parallelism to the basic hello world program?

OpenMP is a library of functions and macros, so we need to include a header file omp.h with prototypes and macro definitions.

The very first directive that we will look at is the parallel directive. The parallel directive forks threads to carry out the work given in the parallel block of code.

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(int argc, char **argv) {

#pragma omp parallel
   printf("Hello World\n");
}

To compile it, you’ll need to add an extra flag to tell the compiler to treat the source code as an OpenMP program.

gcc -fopenmp -o hello hello.c

If you prefer Intel compilers to GCC, use:

icc -qopenmp -o hello hello.c

NOTE: The default compilers in StdEnv/2020 are gcc/9.3.0 and Intel/2020.1.217. Other compilers can be used by loading their respective modules. For example, to load gcc/10:

module load gcc/10.2.0

When you run this program, you should see the output “Hello World” multiple times. But how many?

The OpenMP standard says this is implementation dependent. But the usual default is that OpenMP will look at the machine it is running on and see how many cores there are. It will then launch a thread for each core.

You can control the number of threads with environment variable OMP_NUM_THREADS. For example, if you want only 3 threads, do the following:

export OMP_NUM_THREADS=3
./hello

Execution steps of the parallel “Hello, world” program

The parallel pragma directs compiler to make a code that will start a number of threads equal to what was passed to the program via the variable OMP_NUM_THREADS
Each thread then executes the function printf(“Hello World\n”)
The threads rejoin the main thread when they return from the printf() function, at which point they are terminated
The main thread is then terminated itself

Using multiple cores

Try running the “hello” program with different numbers of threads.

Can you use more threads than the cores on the machine?

You can use nproc command to find out how many cores are on the machine.

Solution

Threads are an OS abstraction and have no direct relationship to cores. You can launch as many threads as you want (the maximum number of threads can be limited by OS and/or OpenMP implementation), however the performance may degrade if you use more threads than physical cores.

OpenMP with SLURM

When you wish to submit an OpenMP job to the job scheduler SLURM, you can use the following boilerplate.
#!/bin/bash
#SBATCH --account=sponsor0
#SBATCH --time=0:01:0
#SBATCH --cpus-per-task=3
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
./hello
You could also ask for an interactive session with multiple cores like so:
[user45@login1 ~]$ salloc --mem-per-cpu=1000 --cpus-per-task=3 --time=1:0:0
salloc: Granted job allocation 179
salloc: Waiting for resource configuration
salloc: Nodes node1 are ready for job
[user45@node1 ~]$ 
The most practical way to run our short parallel program on our test cluster is using srun command. Instead of submitting the job to the queue srun will run the program from the interactive shell as soon as requested resources will become available. After the job is finished slurm will release the allocated resources and exit. Srun understands the same keywords as sbatch and salloc.

In SLURM environment operating system will see as many CPUs as you requested, so strictly there is no need to set OMP_NUM_THREADS variable to $SLURM_CPUS_PER_TASK.
srun --cpus-per-task=4 hello
# or even shorter:
srun -c4 hello

Identifying threads

Download and Unpack the Code.

If you have not yet done so, download and unpack the code:
cd scratch
wget https://github.com/ssvassiliev/Summer_School_OpenMP/raw/master/code/omp.tar.gz
tar -xf omp.tar.gz
cd code

How can we tell which thread is doing what?
The OpenMP specification includes a number of functions that are made available through the included header file “omp.h”. One of them is the function “omp_get_thread_num( )”, used to get an ID of the thread running the code.

/* --- File hello_world_omp.c --- */
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(int argc, char **argv) {
   int id;

#pragma omp parallel
   {
     id = omp_get_thread_num();
     printf("Hello World from thread %d\n", id);
   }
}

Here, you will get each thread tagging their output with their unique ID, a number between 0 and (number of threads - 1).

Pragmas and code blocks in FORTRAN

An OpenMP directive applies to the code block following it in C or C++. Code blocks are either a single line, or a series of lines wrapped by curly brackets.

Because Fortran doesn’t have an analogous construction, many OpenMP directives in Fortran are paired with the matching “end” directive, such as !$omp parallel end.

Thread ordering

Try running the program a few times.

What order do the threads write out their messages in?

What’s going on?

Solution

The messages are emitted in random order. This is an important rule of not only OpenMP programming, but parallel programming in general: parallel elements are scheduled to run by the operating system and order of their execution is not guaranteed.

Conditional Compilation

We said earlier that you should be able to use the same code for both OpenMP and serial work. Try compiling the code without the -fopenmp flag.

What happens?

Can you figure out how to fix it?

Hint: If compiler is called with the OpenMP option it defines preprocessor macro _OPENMP, so you can use #ifdef _OPENMP and #endif preprocessor directives to tell compiler to process the line calling the omp_get_thread_num() function only if this macro is defined.
Solution
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(int argc, char **argv) {
   int id = 0;
   #pragma omp parallel
   {
#ifdef _OPENMP
   id = omp_get_thread_num();
#endif
   printf("Hello World from thread %d\n", id);
   }
}

A work-sharing construct divides the execution of the enclosed code region among the members of the thread team that encounter it.

Work-sharing constructs do not launch new threads
A program will wait for all threads to finish at the end of a work sharing construct. This behaviour is called “implied barrier”.

For

For construct divides iterations of a loop across the team of threads.
Each thread executes the same instructions. This assumes a parallel region has already been initiated, otherwise it executes in serial on a single processor.
For represents a type of data parallelism.

...
#pragma omp parallel for
    for (i=0; i < N; i++)
        c[i] = a[i] + b[i];
...

Stack Overflow

The easiest way to declare arrays as static:
 A[2048][2048];
Globally defined static arrays are allocated when a program starts, and they occupy memory until a program ends. If you declare large arrays as static, your program may crash with a “Segmentation fault” error. Static arrays are allocated on the stack, and the OS limits the size of the stack memory available to a user. You can check your stack memory limit using the command:
ulimit -s

Sections

Sections construct breaks work into separate, discrete sections.
It is is a non-iterative work-sharing construct.
It specifies that the enclosed section(s) of code are to be divided among the threads in the team. Independent section directives are nested within a sections directive. Each section contains different instructions and is executed once by a thread.
It is possible for a thread to execute more than one section.
Sections can be used to implement a functional parallelism.

#pragma omp parallel shared(a,b,c,d) private(i)
  {
#pragma omp sections nowait
    {

#pragma omp section
    for (i=0; i < N; i++)
      c[i] = a[i] + b[i];

#pragma omp section
    for (i=0; i < N; i++)
      d[i] = a[i] * b[i];

    }  /* end of sections */
  }  /* end of parallel region */

Here nowait keyword (clause) means that the program will not wait at the end of sections block for all threads to finish.

Exercise

Compile the file sections.c and run it on a different number of CPUs. This example has two sections and the program prints out which threads are doing them.

What happens if the number of threads and the number of sections are different?

More threads than sections?

Less threads than sections?

Solution

If there are more threads than sections, only some threads will execute a section. If there are more sections than threads, the implementation defines how the extra sections are executed.

Single

The single directive specifies that the enclosed code is to be executed by only one thread in the team.
May be useful when dealing with sections of code that are not thread safe (such as I/O).

Key Points

Pragmas are directives to the compiler to parallelize something

Thread number is typically controlled with an environment variable, OMP_NUM_THREADS

Order of execution of parallel elements is not guaranteed.

If the compiler doesn’t recognize OpenMP pragmas, it will compile a single-threaded program. But you may need to escape OpenMP function calls.

Parallel Operations with Arrays

Overview

Teaching: 20 min
Exercises: 20 min

Questions

How do I parallelize a loop?

Objectives

Use the PARALLEL FOR pragma (or PARALLEL DO)

Use the PRIVATE clause

Be able to identify data dependencies

One of the classic applications of programming is linear algebra, in all of its beauty and complexity. We will use a few simple problems to see how to execute loops in parallel using OpenMP.

Multiplying an array by a constant

The simplest problem is applying some function to an array of numbers. An example is multiplying each value by some constant number. In serial code you would loop over all array elements to do this multiplication:

/*  -- File array_multiply.c --- */
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int main(int argc, char **argv)
{
	struct timespec ts_start, ts_end;
	int size = 1e8;
	int multiplier = 2;
	int *a, *c;
	int i;
	float time_total;

	/* Allocate memory for arrays */
	a = malloc(size * sizeof(int));
	c = malloc(size * sizeof(int));

	/* Get start time */
	clock_gettime(CLOCK_MONOTONIC, &ts_start);

	/* Multiply array a by multiplier */
	for (i = 0; i < size; i++)
	{
		c[i] = multiplier * a[i];
	}

	/* Get end time */
	clock_gettime(CLOCK_MONOTONIC, &ts_end);

	time_total = (ts_end.tv_sec - ts_start.tv_sec) * 1e9 +
				 (ts_end.tv_nsec - ts_start.tv_nsec);
	printf("Total time is %f ms\n", time_total / 1e6);
}

We added calls to clock_gettime( ) function declared in the time.h header file to get the start and end times of the heavy work being done by the for loop.
We used counts of how many seconds and how many nanoseconds elapsed, that are returned in the structures ts_start and ts_end. Then we did some math to convert the elapsed time to milliseconds.

Compiling and running a serial version

Compile the program.

gcc array_multiply.c -o array_multiply

Run it on the cluster.

srun --mem=2000 array_multiply

Time and Size

Run the program several times and observe execution time.

What happens to the run time of your program through multiple runs?

Change the size of the array, recompile and rerun the program. Observe execution time.

Did execution time change proportionally to the size of the array?

Creating a Parallel Version

Let’s parallelize this program using a parallel directive. We need to make two changes:

Include OpenMP header file omp.h.
Tell compiler that we want to parallelize the main for loop.

/*  -- File array_multiply_omp.c --- */
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <omp.h> /* <--- OpenMP header --- */

int main(int argc, char **argv)
{
	struct timespec ts_start, ts_end;
	int size = 1e8;
	int multiplier = 2;
	int *a, *c;
	int i;
	float time_total;

	/* Allocate memory for arrays */
	a = malloc(size * sizeof(int));
	c = malloc(size * sizeof(int));

	/* Get start time */
	clock_gettime(CLOCK_MONOTONIC, &ts_start);

/* Multiply array a by multiplier */
#pragma omp parallel for /* <--- OpenMP parallel for loop --- */
	for (i = 0; i < size; i++)
	{
		c[i] = multiplier * a[i];
	}

	/* Get end time */
	clock_gettime(CLOCK_MONOTONIC, &ts_end);

	time_total = (ts_end.tv_sec - ts_start.tv_sec) * 1e9 +
				 (ts_end.tv_nsec - ts_start.tv_nsec);
	printf("Total time is %f ms\n", time_total / 1e6);
}

Compiling and running a parallel version

Compile the program.

gcc array_multiply_omp.c -o array_multiply_omp -fopenmp

Run it on the cluster with 4 threads.

srun -c4 --mem-per-cpu=1000 array_multiply_omp 

Using threads

Run the program with different number of threads and observe how the runtime changes.

What happens to the runtime when you change the number of threads?

In this example, the number of iterations around the for loop gets divided across the number of available threads. In order to do this, OpenMP needs to know how many iterations there are in the for loop. This leads to another requirement: the number of iterations can not change part way through the loop. For the same reasons while loops can not be parallelized.

To ensure the parallel for loop works correctly

you must not change the value of size within one of the iterations.
you must not use a call to break() or exit() within the for loop, either. These functions pop you out of the for loop before it is done.

Summing the values in a matrix

Now let’s try adding up the elements of a matrix. Moving up to two dimensions adds a new layer of looping. The basic code looks like the following.

/* --- File matrix_multiply_omp.c --- */
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <omp.h>

int main(int argc, char **argv)
{
	struct timespec ts_start, ts_end;
	int size = 1e4;
	int **a, *c;
	int i, j;
	float time_total;

	/* Allocate memory */
	c = malloc(size * sizeof(int));
	a = (int **)malloc(size * sizeof(int *));
	for (i = 0; i < size; i++)
		a[i] = malloc(size * sizeof(int));

	/* Set all matrix elements to 1 */
	for (i = 0; i < size; i++)
	{
		for (j = 0; j < size; j++)
		{
			a[i][j] = 1;
		}
	}

	/* Zero the accumulator */
	for (i = 0; i < size; i++)
	{
		c[i] = 0;
	}

	clock_gettime(CLOCK_MONOTONIC, &ts_start);

#pragma omp parallel for
	/* Each thread sums one column */
	for (i = 0; i < size; i++)
	{
		for (j = 0; j < size; j++)
		{
			c[i] += a[i][j];
		}
	}

	int total = 0;
	/* Add sums of all columns together */
	for (i = 0; i < size; i++)
	{
		total += c[i];
	}

	clock_gettime(CLOCK_MONOTONIC, &ts_end);
	time_total = (ts_end.tv_sec - ts_start.tv_sec) * 1e9 +
				 (ts_end.tv_nsec - ts_start.tv_nsec);
	printf("Total is %d, time is %f ms\n", total, time_total / 1e6);
}

Is the result correct?

What should be the result of this code?

Is that what it does? If not, what might be wrong?

Solution

The elements all have value 1, and there are 1e4*1e4 of them, so the total should be 1e8. Why isn’t it?

OpenMP threads share memory. This means that every thread can see and access all of memory for the process. In this case, multiple threads are all accessing the loop variable j. Threads that started last will reset j and as a result threads that started before will count elements more than once.

OpenMP private clause

To run the above task correctly in parallel each thread must have its own copy of the variable j. This can be achieved by declaring variables as private:

#pragma omp parallel for private(j)

This directive tells compiler that every thread must use its own private copy of j.

When a variable is declared private, OpenMP replicates this variable and assigns its local copy to each thread. Private copies of the variable are initialized from the original object when entering the region.

variables declared outside of a parallel region are implicitly shared
variables declared inside are implicitly private.

Thus another way to fix this problem is to define j inside the parallel for loop. That’s perfectly valid, but it will require changing code. One of the goals of OpenMP is to allow parallelization by only adding pragma statements that can be enabled or disabled at compile time.

Be Careful with Data Dependencies

In turning a serial program into a parallel program, it’s vital to maintain the correctness of the serial program. Data dependency is an important concept we use to guarantee that we’re transforming a serial program into an equivalent (in terms of its output) parallel program.

What is data dependency? If two statements read or write the same memory location, and at least one of the statements writes that memory location, then there is a data dependency on that memory location between the two statements. Loops with data dependencies can not be executed in parallel because the results may not be correct.

If there is a data dependency between two statements, then the order in which those statements are executed may affect the output of the program, and hence its correctness.

Consider this loop that computes a cumulative sum:

for ( i = 1; i < N - 1; i = i + 1 ) {
    a[i] = a[i] + a[i+1];
}

If we run to run this loop in parallel different iterations would be carried out by different threads of execution. If any two iterations didn’t happen in the order dictated by the serial code, as shown in the figure below, the results would be wrong.

Types of data dependencies:

Since either of the two statements can read or write a variable, there are four types of data dependencies.

FLOW (READ after WRITE), like the last example, when one statement uses the results of another.
ANTI (WRITE after READ), when one statement should write to a location only ‘‘after’’ another has read what’s there.
OUTPUT (WRITE after WRITE), when two statements write to a location, so the result will depend on which one wrote to it last.
INPUT (READ after READ). Both statements read a variable. Since neither tasks writes, they can run in either order.

View wikipedia page on data dependencies

Is There a Dependency?

Which of the following loops have data dependencies?
/* loop #1 */
for ( i=2; i<N; i=i+2 ) {
    a[i] = a[i] + a[i-1]; }

/* loop #2 */
for ( i=1; i<N/2; i=i+1 ) {
    a[i] = a[i] + a[i+N/2]; }

/* loop #3 */
for ( i=0; i<N/2+1; i=i+1 ) {
    a[i] = a[i] + a[i+N/2]; }

/* loop #4 */
for ( i=1; i<N; i=i+1 ) {
    a[idx[i]] = a[idx[i]] + b[idx[i]]; } 
Solution

Loop #1 does not. The increment of this loop is 2, so in the step 2 we compute a[2]=a[2]+a[1]. In the next step we compute a[4]=a[4]+a[3] … etc.
Loop #2 does not. In this range of i values each thread modifies only one element of the array a.
Loop #3 does. Here the last iteration creates data dependency writing to a[N/2]:
i=N/2; a[N/2] = a[N/2] + a[N]  
i=0;   a[0] = a[0] + a[N/2]
Loop #4 might or might not, depending on the contents of the array idx. If any two entries of idx are the same, then there’s a dependency.

Not everything can be parallelized. If there is data dependency you may need to modify your algorithm to be more parallel-friendly. Spend some time (drawing diagrams may help) to see what happens when two (or more) threads work simultaneously on a loop. It’s not pretty. One of the strategies for solving some data dependency problems may be to write into a new copy of an array make and at the end of each iteration update old from new.

Thread-safe functions

Another important concept is that of a thread-safe function.

Consider this code;

#pragma omp parallel for
for(i = 0, i < N, i++)
	y[i] = func(x[i])

Will this give the same results as the serial version?
Maybe … It depends on what function does. If the result does not depend on the order threads use different values of i, then the function is thread safe.

A thread-safe function is one which can be called simultaneously by multiple threads of execution.

Consider the function computing distance between two points in 2D:

float funct(float *p1, float *p2)
{
	float dx, dy;
	dx = p2[0]-p1[0];
	dy = p2[1]-p1[1];
	return(sqrt(dx*dx + dy*dy));
}

Execution of this function for different values of p1 and p2 is completely independent. Note that each call creates a new local values dx and dy on the call stack, so they are private to the thread executing the function. This function is thread safe.

Consider a modified version of the code where dx and dy are defined outside of the function:

float dx, dy;
float funct(float *p1, float *p2)
{
	dx = p2[0]-p1[0];
	dy = p2[1]-p1[1];
	return(sqrt(dx*dx + dy*dy));
}

The variables dx and dy may be modified by another thread after they are computed and before they are used. This function is not thread-safe.

You need to be aware when writing multi-threaded programs whether functions you are using are thread safe or not. For example random number generator rand() is not thread safe. For threaded applications thread safe version of random number generator rand_r() is available.

For more, see Thread safety.

Optimizing performance

CPU cache and data locality

Let’s do a quick experiment. Run our array_multiply_omp.c code on 4 CPUs and record timing. On our training cluster you will get something like:

Total is 100000000, time is 521.661536 ms

Then swap i and j indexes in line 38 of the main loop:

#pragma omp parallel for
        for (i = 0; i<size; i++) {
                for (int j=0; j<size; j++) {
                        c[i] += a[i][j]; /* <-- Swap indexes --- */
                }
        }

Recompile the program and run it again.

Total is 100000000, time is 17.243648 ms

What is going on?
Modern compilers should be smart enough to figure out how to maximize the performance, right?

Modern CPUs are fast, and main memory access is relatively slow. This can be a huge processing bottleneck because CPU will idle for many cycles waiting for the data. As a compromise, CPU manufacturers add some fast memory on the CPU chip. This memory is called CPU cache. Cache organization is fairly complex, with several cache levels differing in the access time.

When we want to loop through an array, the first access would go to memory to fetch data, and this is a very slow operation. To deal with this bottleneck upon the first access to the array instead of grabbing just one element CPU grabs a large chunk of data and puts it into the cache. The next time the CPU needs some data it first looks in the cache.

if we organize our computations such that once a chunk of data is fetched into the cache all that data is used in the computation before a different line is fetched, our program will run a lot faster than if CPU will keep fetching data from different parts of memory. This is called data locality.

Taking advantage from cache is possible only if an array we are looping through is stored sequentially as contiguous memory block.

C/C++ compiler stores static matrices in row-major order. Although we did not use static array, we stored our matrix in the same way. So if our inner loop iterates through elements of a row as shown in the diagram below, then a chunk of a row will be loaded into cache upon the first access.

  2 ... 1000          j=1               j=2   
1002 ... 2000  -->  1 2 ... 1000   1001 1002 ... 2000 
2002 ... 3000
...

This will not happen if in out inner loop we iterate through elements of a column (loop variable i) because columns are not contiguous memory blocks:

  2 ... 1000          i=1                          j=2   
1002 ... 2000  -->  1 1001 2001 ... 1000000   1002 1202 ... 1000000  
2002 ... 3000
...

In this case, CPU will need to access main memory to load each matrix element.

Avoid parallel overhead at low iteration counts

Creating a thread is expensive, it may cost thousands of CPU cycles. f you have a function that requires only hundreds of cycles, it is wasteful to parallelize it. The overhead alone will set you back. This can be avoided using conditional parallelization:

#pragma omp parallel for if(N > 1000) 
for(i = 0; i < N; i++)
	{
		a[i] = k*b[i] + c[i];
	}

Minimize parallelization overhead

If the inner loop is parallelized, in each iteration step of the outer loop, a parallel region is created. This causes parallelization overhead.

Key Points

The PARALLEL FOR (or PARALLEL DO) pragma makes a loop execute in parallel

A single variable accessed by different threads can cause wrong results

The PRIVATE clause makes a copy of a variable for each thread

Race Conditions with OpenMP

Overview

Teaching: 20 min
Exercises: 5 min

Questions

How can we calculate integrals in parallel?

How to compile a program using math functions with gcc?

Objectives

Understand what is a race condition

Learn how to avoid race conditions

Learn how to use reduction variables

Introduction

A data race occurs when two threads access the same memory without proper synchronization. This can cause the program to produce incorrect results in parallel mode. As we have learned, loops are parallelized by assigning different loop iterations to different threads. Because loop iterations can run in any order when two threads write to a shared variable in a parallel region the final value of the variable depends on which iteration writes last. In sequential mode, this is always the last iteration, but in parallel mode, this is not guaranteed.

In this section, we will use two example problems: parallel numerical integration and finding maximum in an array to look at how to control access to global variables.

Parallel numerical integration

As our example, let’s integrate the sine function from 0 to $\pi$. This is the same as the area under the first half of a sine curve. To compute approximation of an integral we will use the simplest Rectangle Method. We will partition area under the curve into a number of very narrow rectangles and add areas of these small shapes together. The single-threaded version is:

/* --- File integrate_sin.c --- */
#include <stdio.h>
#include <stdlib.h>
#include <math.h>

int main(int argc, char **argv) {
	int nRect = 1e7;             /* Number of rectangles */ 
	double delta = M_PI/nRect;   /* Width of each rectangle */
	double total = 0.0;          /* Accumulator */
	int i;

	printf("Approximating integral with %.0e rectangles\n", (float)nRect);

	for (i=0; i<steps; i++) 
	{
		total += sin(delta*i) * delta;
	}

	printf("The integral of sine from 0 to Pi is %.12f\n", total);
}

Compiling code with mathematical functions

To compile a C program that uses math library with GCC we need to explicitly link to it:

 gcc -o integrate integrate_sin_omp.c -lm

Let’s run the program.

./integrate

Using 1e+07 steps
The integral of sine from 0 to Pi is 2.000000000000

The result in this case should be 2.0, and with 1e7 steps our program computed it accurately. To see what happens to the time this program takes, we’ll use a new tool. Since we just want to see the total time, we can use the command time:

srun time -p ./integrate

Using 1e+07 steps
The integral of sine from 0 to Pi is 2.000000000000
real 0.94
user 0.94
sys 0.00

The output real is the useful one; this example took 0.941 seconds to run. The user and sys lines describe how much time was spent in the “user” code and how much in the “system” code, a distinction that doesn’t interest us today.

Parallelizing Numerical Integration

The program spends most of its time computing areas of small rectangles and adding them together. Let’s parallelize the main loop and execute the code.

The program works, but the result is incorrect when we use more than one thread. What is the problem?

The data dependency on total leads to a race condition. Since we are updating a global variable, there is a race between the various threads as to who can read and then write the value of total. Multiple threads could read the current value, before a working thread can write the result of its addition. So these reading threads essentially miss out on some additions to the total.

How to avoid data race conditions?

One strategy to avoid data race is to synchronize threads to ensure that the variable is accessed in the right order. OpenMP provides a number of ways to ensure that threads are executed in the right order.

The omp critical directive

Race conditions can be avoided by adding a critical directive. A critical directive only allows one thread at a time to run some block of code.

/* --- File integrate_sin_omp.c --- */
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>

int main(int argc, char **argv) {
	int steps = 1e4;
	double delta = M_PI/steps;
	double total = 0.0;
	int i;

	printf("Using %.0e steps\n", (float)steps);

#pragma omp parallel for
	for (i=0; i<steps; i++) {
// #pragma omp critical
// #pragma omp atomic
		total += sin(delta*i) * delta;
	}

	printf("The integral of sine from 0 to Pi is %.12f\n", total);
}

The critical directive is a very general construct; it can be applied to any arbitrary code block. It lets you ensure it is executed exclusively by one thread. It does it by locking this code block for other threads when it is executed. The drawback of this directive is poor performance. The first thread that acquires the lock makes progress, while others sit and wait in line until it has finished. In addition, significant overhead is added every time a thread enters and exits the critical section. And this overhead is on top of the inherent cost of serialization.

Let’s add the omp critical directive to the statement in the main loop and rerun the code. The addition of the critical directive slows down the program relative to the serial version. And if we run it with more threads, it slows down even more. Using critical, in this case, is a wrong decision because critical serializes the whole loop.

The omp atomic directive

Another way to avoid race conditions is to use omp atomic directive. The omp atomic directive is similar to critical but one thread being in an atomic operation doesn’t block any other atomic operations about to happen. Where available, atomic takes advantage on the CPU instructions providing atomic operations. Depending on the CPU architecture, some CPU instructions such as such as read-modify-write, fetch-and-add, compare-and-swap, ..etc) are atomic. These instructions perform multiple things in memory in a single, atomic step which can not be interrupted. In that case there’s no lock/unlock needed on entering/exiting the line of code, it just does the atomic operation, and hardware (or OS) ensures that it is not interfered with. Another advantage of the omp atomic directive is much lower overhead.

The downsides are that it can be used only to control a single statement and the set of operations that atomic supports is restricted. Of course, with both omp critical and omp atomic, you incur the cost of serialization.

Examples demonstrating how to use atomic:

#pragma omp atomic update
     k += n*mass;  /* update k atomically */

#pragma omp atomic read
     tmp = c;		/* read c atomically */

#pragma omp atomic write
     count = n*m;		/* write count atomically */

#pragma omp atomic capture
{ /* capture original value of v in d, then atomically update v */
		 d = v; 
		 v += n; 
} 

The atomic clauses: update (default), write, read, capture

Parallel Performance

Insert the omp critical directive, recompile the code and run it on more that one thread. Try different number of threads and record execution time.

Repeat with the omp atomic directive.

How execution time with one thread compares to 2 and 4 threads?

Solution

Version Threads Time

serial 0.94

parallelized using critical 1 1.11

parallelized using critical 2 1.32

parallelized using critical 4 1.71

parallelized using atomic 1 1.03

parallelized using atomic 2 0.69

parallelized using atomic 4 0.72

Version	Threads	Time
serial		0.94
parallelized using critical	1	1.11
parallelized using critical	2	1.32
parallelized using critical	4	1.71
parallelized using atomic	1	1.03
parallelized using atomic	2	0.69
parallelized using atomic	4	0.72

The Optimal Way to Parallelize Integration with OpenMP

The best option to parallelize summation is to let each thread to operate on its own chunk of data and when all threads finish add up their sums together. OpenMP provides a specific thread-safe mechanism to do this: the reduction clause. The reduction clause lets you specify thread-private variables that are subject to a reduction operation at the end of the parallel region.

As we are doing summation, the reduction operation we are looking for is “+”. At the end of the reduction, the values of all private copies of the shared variable will be added together, and the final result will be written to the global shared variable.

Let’s comment out both the critical and the atomic directives and add the reduction variable total subjected to the reductions operator “+” to the parallel for loop:

#pragma omp parallel for reduction(+: total)

Recompile the code and execute. Now we got the right answer, and x3.7 speedup with 4 threads!

In addition to summation OpenMP supports several other reduction operations, such as multiplication, minimum, maximum, logical operators. Next, we will look at other uses of reduction variables.

Finding the maximum value in an array

Let’s say that we need to search through an array to find the largest value. How could we do this type of search in parallel? Let’s begin with the serial version.

/* --- File array_max.c --- */
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>

int main(int argc, char **argv) {
	int size = 1e8;
	int *rand_nums;
	int i;
	int curr_max;
	time_t t;

	rand_nums=malloc(size*sizeof(int)); 

	/* Intialize random number generator */
	srand((unsigned) time(&t));

	/* Initialize array with random values */
	for (i=0; i<size; i++) {
		rand_nums[i] = rand();
	}

    /* Find maximum */
	curr_max = 0.0;
	for (i=0; i<size; i++) {
		curr_max = fmax(curr_max, rand_nums[i]);
	}

	printf("Max value is %d\n", curr_max);
}

This problem is analogous to the summation. You would want to make sure that each thread has a private copy of the curr_max variable, since it will be written to. When all threads have found the maximum value in their share of data you would want to find out which thread has the largest value.

Key Points

Race conditions can be avoided by using the omp critical or the omp atomic directives

The best option to parallelize summation is to use the reduction directive

Searching through data

Overview

Teaching: 20 min
Exercises: 15 min

Questions

How to search in parallel

Objectives

Use general parallel sections

Have a single thread execute code

Use a reduction operator

So far, we have looked at parallelizing loops.

OpenMP also allows you to use general parallel sections (parallel Construct).

When a thread encounters #pragma omp parallel directive OpenMP creates a team of threads. The thread that encountered the parallel directive first becomes the master thread of the new team, with a thread number of zero. Parallel region is executed by all of the available threads.

In these cases, it is up to you as a programmer to manage what work gets done by each thread. A basic example would look like the following.

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(int argc, char **argv) {
   int id, size;

   #pragma omp parallel private(id, size)
   {
      size = omp_get_num_threads();
      id = omp_get_thread_num();
      printf("This is thread %d of %d\n", id, size);
   }
}

Private variables

What happens if you forget the private keyword?

Using this as a starting point, we could use this code to have each available thread do something interesting. For example, we could write the text out to a series of individual files.

Single threaded function

There are times when you may need to drop out of a parallel section in order to have a single one of the threads executing some code.

The #pragma omp single directive allows us to do this.

A code block associated with the single directive will be executed by only the first thread to see it.

More information: single Construct

Which thread runs first?

Modify the following code to print out only the thread that gets to run first in the parallel section. You should be able to do it by adding only one line. Here’s a (rather dense) reference guide in which you can look for ideas: Directives and Constructs for C/C++.

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(int argc, char **argv) {
   int id, size;

   #pragma omp parallel private(id,size)
   {
      size = omp_get_num_threads();
      id = omp_get_thread_num();
      printf("This thread %d of %d is first\n", id, size);
   }
}

Solution

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(int argc, char **argv) {
   int id, size;

   #pragma omp parallel private(id,size)
   {
      size = omp_get_num_threads();
      id = omp_get_thread_num();
      #pragma omp single
      printf("This thread %d of %d is first\n", id, size);
   }
}

Finding the maximum value in an array

Let’s say that we need to search through an array to find the largest value. How could we do this type of search in parallel? Let’s begin with the serial version.

/* --- File array_max.c --- */
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char **argv) {
   int size = 10000;
   float rand_nums[size];
   int i;
   float curr_max;

   /* Initialize with random values */
   for (i=0; i<size; i++) {
      rand_nums[i] = rand();
   }

   /* Find maximum */
   curr_max = 0.0;
   for (i=0; i<size; i++) {
      if (curr_max < rand_nums[i]) {
         curr_max = rand_nums[i];
      }
   }

   printf("Max value is %f\n", curr_max);
}

The first stab would be to make the for loop a parallel for loop. You would want to make sure that each thread had a private copy of the curr_max variable, since it will be written to. But, how do you find out which thread has the largest value?

Reduction Operators

You could create an array of curr_max variables, but getting that to work right would be messy. How do you adapt to different NUM_THREADS?

The keys here are

To recognize the analogy with the problem of total from the last episode, and
To know about reduction variables.

A reduction variable is used to accumulate some value over parallel threads, like a sum, a global maximum, a global minimum, etc. The reduction operators that you can use are: +, *, -, &, |, ^, &&, ||, max, min.

 /* --- File array_max_omp.c --- */
 #include <stdio.h>
 #include <stdlib.h>
 #include <time.h>

 int main(int argc, char **argv) {
	struct timespec ts_start, ts_end;
	float time_total;
	int size = 1e7;
	int *rand_nums;
	int i;
	int curr_max;
	time_t t;

	rand_nums=malloc(size*sizeof(int));
	/* Intialize random number generator */
	srand((unsigned) time(&t));
	/* Initialize array with random values */
	for (i=0; i<size; i++)
		rand_nums[i] = rand();

	curr_max = 0.0;
	/* Get start time */
	clock_gettime(CLOCK_MONOTONIC, &ts_start);

 #pragma omp parallel for reduction(max:curr_max)
	for (i=0; i<size; i++)
		if (curr_max < rand_nums[i]) {
			curr_max = rand_nums[i];
		}

	/* Get end time */
	clock_gettime(CLOCK_MONOTONIC, &ts_end);
	time_total = (ts_end.tv_sec - ts_start.tv_sec)*1e9 + \
		     (ts_end.tv_nsec - ts_start.tv_nsec);
	printf("Total time is %f ms\n", time_total/1e6);
	printf("Max value is %d\n", curr_max);
 }

Key Points

Reduction operators handle the common case of summation, and analogous operations

OpenMP can manage general parallel sections

You can use ‘pragma omp single’ to have a single thread execute something

OpenMP Tasks

Overview

Teaching: 20 min
Exercises: 5 min

Questions

How to recurse in parallel

Objectives

Use general parallel tasks

Introduction

The last construct we will look at is the omp task. It is the most recent addition to OpenMP specification introduced in version 3.0 in 2008. Older constructs worked well for many cases but there were limitations hampering development of rapidly developing advanced applications.

Only loops with a known length at run time were allowed, and only finite number of parallel sections.

This didn’t work well with certain common problems such as recursive or pointer-chasing algorithms.

Pointer chasing algorithms employ a series of irregular memory access patterns where the data of the previous memory access is required in order to determine the memory address of the next pointer.

Another flow control statement that was hard to parallelize was the while loop with it’s unknown number of iterations.

The omp task construct addresses these issues by generating tasks that are then executed by all available threads.

What is a task?

A task is composed of the code to be executed and the data environment (inputs to be used and outputs to be generated).

General concept of tasks is not new to OpenMP, but before introduction of the omp task construct tasks were implicit. For example an omp parallel construct creates implicit tasks, one task per thread and tasks execution begins.

The omp task specification introduced a different model. When a thread encounters the task construct it will create an explicit task. The execution of an explicit task may be immediate or delayed. If delayed, the task in placed in a pool of tasks associated with the current parallel region. All available threads can take tasks out of the pool and execute them until the pool is empty. The OpenMP specification further permits an implementation to suspend the generation of tasks, and to suspend the execution of specific tasks, in order to maintain system efficiency.

Due to its dynamic nature, a tasking implementation is more complicated than its static counterparts. In addition to the basic functions of creating and controlling dependencies between tasks, the run-time tasking implementation need to decide how and when to schedule tasks. The function of an ideal scheduler is very complex: it should be able to dynamically balance the workload on the threads, ensuring that all the threads do the same amount of work. At the same time it should consider data locality. To address the latter tasks operating on the same data should be scheduled for execution on the same thread to improve data reuse and maximize performance.

Another important problem affecting performance is task generation. A task-generating program can overload a system’s resources, like memory usage, placing a strain on the system. Recursive algorithms, in particular, can quickly generate millions of tasks.

As the OpenMP specification does not prescribe the algorithm for scheduling tasks for execution, the implementation of the scheduling is compiler-specific.

A typical program using tasks would follow the code shown below:

/* Create threads */
#pragma omp parallel 
{
    #pragma omp single 
    { 
        for(i<1000) /* Generate tasks */
        #pragma omp task
           work_on(i) 
    }
}

A parallel region creates a team of threads
A single thread then creates the tasks, adding them to a pool that belongs to the team
All the threads in that team execute tasks from this queue. Tasks could run in any order.

If you need to have results from the tasks before you can continue, you can use the taskwait directive to pause your program until the tasks are done.

Example - finding smallest factor

Let’s use tasks to find the smallest factor of a large number (using 4993*5393 as test case). Start with the serial code:

/* File: find_factor.c */
#include <stdio.h>
#include <stdlib.h>

int main()
{
    int N = 26927249;
    int f;
    for (f = 2; f <= N; f++)
    {
        if (f % 200 == 0) /* Print progress */
        {
            fprintf(stdout, "checking: %li\n", f);
            fflush(stdout);
        }
        /* Check if f is a factor */
        if (N % f == 0)
        { // The remainder is 0, found factor!
            fprintf(stdout, "Factor: %li\n", f);
            exit(0); /* Terminate the program */
        }
        for (int i = 1; i < 4e6; i++)
            ; /* Burn some CPU cycles */
        /* End of factor finding block */    
    }
}

Turn the factor finding block into a task.
Generate a task for each trial factor.
Once a factor has been found, you should stop generating tasks.

/* File find_factor_omp.c */
#include <stdio.h>
#include <stdlib.h>

int main()
{
  long N = 4993 * 5393;
  long f;
#pragma omp parallel
#pragma omp single
  for (f = 2; f <= N; f++) /* Loop generating tasks */
  {
    if (f % 200 == 0) /* Print progress */
    {
      fprintf(stdout, "%li tasks generated\n", f);
      fflush(stdout);
    }
#pragma omp task
    { /* Check if f is a factor */
      if (f % 200 == 0) /* Print progress */
        fprintf(stdout, "    %li tasks done\n", f);
      if (N % f == 0)
      { // The remainder is 0, found factor!
        fprintf(stdout, "Factor: %li\n", f);
        exit(0); /* Terminate the program */
      }
      else
        for (int i = 1; i < 4e6; i++)
          ; /* Burn some CPU cycles */
    }
  } 
}

Run the program in parallel and compare execution time with the serial version.

time srun ./a.out
time srun -c4 ./a.out

How many tasks are in the pool? Not all tasks are generated right away. A scheduler controls the size of the task pool to prevent the overloading of the system. At some point, when the list of deferred tasks is too long, the implementation is allowed to stop generating new tasks, and switches every thread in the team on executing already generated tasks.
Try submitting a job with more threads and look at how many tasks will be generated.

time srun -c4 --export OMP_NUM_THREADS=20 ./a.out

Task Synchronization

Some applications require tasks to be executed in a certain order, but it is impossible to know when a task will be executed because a scheduler decides when and where to run a task. Several OpenMP directives are available for task synchronization.

For example, consider the code:

x = f();
#pragma omp task
{ y1 = g1(x); }
#pragma omp task
{ y2 = g2(x); }
z = h(y1) + h(y2);

When the statement computing z is executed, the tasks computing y1 and y2 have only been scheduled; they have not necessarily been executed yet. To fix this we need to explicitly wait until tasks computing y1 and y2 are finished before computing z. For this we need to use the omp taskwait directive:

x = f();
#pragma omp task
{ y1 = g1(x); }
#pragma omp task
{ y2 = g2(x); }
#pragma omp taskwait
z = h(y1) + h(y2);

This corrected code will generate two tasks and wait for both of them to be finished before computing z.

Controlling Task Execution Order

In some cases a task may depend on another task, but not on all generated tasks. The taskwait directive in this case is too restrictive.

Consider the following code:

#pragma omp task /* task A */
save_data(A);
#pragma omp task /* task B */
save_data(B);

#pragma omp taskwait

#pragma omp task /* task C */
compute_stats(A);
#pragma omp task /* task D */
compute_stats(B);

In this case task C should be executed after task A, but it won’t run until the execution of task B is finished.

The task depend clause allows you to provide information on the order in which tasks should be executed. However, the best way to understand task dependencies concept is to think of them as of the way to describe how different tasks access data and not in what order they should be executed.

The depend clause is followed by an access mode that can be in, out or inout.
depend(in:x) - task will read variable x
depend(out:y) - task will write variable *y, previous value of y will be ignored.
depend(inout:z) - task will both read and write variable *z
The OpenMP scheduler considers task dependencies and automatically decides whether a task is ready for execution.

The previous example rewritten with task dependencies will be:

#pragma omp task depend(out:A) /* task A */
save_data(A);
#pragma omp task depend(out:B)/* task B */
save_data(B);

#pragma omp taskwait

#pragma omp task depend(in:A)/* task C */
compute_stats(A);
#pragma omp task depend(in:B)/* task D */
compute_stats(B);

This example illustrates READ after WRITE dependency: task A writes A which is then read by the task C.

The depend clause enforces additional constraints on the scheduling of tasks or loop iterations. These constraints establish dependencies only between sibling tasks or between loop iterations. The task dependencies are fulfilled when the predecessor task has completed.

Controlling Order of Data Access

Consider the following loop:

for(i=1;i<N;i++)
  for(j=1;j<N;j++)
    x[i][j] = x[i-1][j] + x[i][j-1];

The loop cannot be parallelized with parallel for construct due to data dependency. The next task should be executed after the previous task updates the variable x.

Use tasks with dependencies to parallelize this code.
Start with the incorrectly parallelized code:

/* --- File task_depend.c --- */
#include <stdlib.h>
#include <stdio.h>

int main(int argc, char **argv) {

  int N = 8;
  int x[N][N];
  int i,j;

  /* Initialize x */
  for(i=0;i<N;i++)
    for(j=0;j<N;j++)
      x[i][j]=i+j;

  /* Serial computation */
  for(i=1;i<N;i++){
    for(j=1;j<N;j++)
      x[i][j] = x[i-1][j] + x[i][j-1];
  }

  printf("Serial result:\n");
  for(i=1;i<N;i++){
    for(j=1;j<N;j++)
      printf("%8d ",x[i][j]);
    printf("\n");
  }

  /* Reset x */
  for(i=0;i<N;i++)
    for(j=0;j<N;j++)
      x[i][j]=i+j;

  /* Parallel computation */
#pragma omp parallel 
#pragma omp single
  for(i=1;i<N;i++){
    for(j=1;j<N;j++)
    #pragma omp task
      x[i][j] = x[i-1][j] + x[i][j-1];
  }

  printf("Parallel result:\n");
  for(i=1;i<N;i++){
    for(j=1;j<N;j++)
      printf("%8d ",x[i][j]);
    printf("\n");
  }
}

You should be able to do this by only adding OpenMP directives.

solution

/* --- File task_depend_omp.c --- */
#include <stdlib.h>
#include <stdio.h>

int main(int argc, char **argv) {

  int N = 8;
  int x[N][N];
  int i,j;

  /* Initialize x */
  for(i=0;i<N;i++)
    for(j=0;j<N;j++)
      x[i][j]=i+j;

  /* Serial computation */
  for(i=1;i<N;i++){
    for(j=1;j<N;j++)
      x[i][j] = x[i-1][j] + x[i][j-1];
  }

  printf("Serial result:\n");
  for(i=1;i<N;i++){
    for(j=1;j<N;j++)
      printf("%8d ",x[i][j]);
    printf("\n");
  }

  /* Reset x */
  for(i=0;i<N;i++)
    for(j=0;j<N;j++)
      x[i][j]=i+j;

  /* Parallel computation */
#pragma omp parallel
#pragma omp single
  /* Generate parallel tasks */
  for(i=1;i<N;i++){
    for(j=1;j<N;j++)
#pragma omp task depend(out:x)
      x[i][j] = x[i-1][j] + x[i][j-1];
  }

  printf("Parallel result:\n");
  for(i=1;i<N;i++){
    for(j=1;j<N;j++)
      printf("%8d ",x[i][j]);
    printf("\n");
  }
}

Computing Fibonacci Numbers

The next example shows how task and taskwait directives can be used to compute Fibonacci numbers recursively.

In the parallel construct, the single directive calls fib(n).

The call to fib(n) generates two tasks. One of the tasks computes fib(n-1) and the other computes fib(n-2).

The two return values are added together to produce the value returned by fib(n).

Each of the calls to fib(n-1) and fib(n-2) will in turn generate two tasks. Tasks will be recursively generated until the argument passed to fib() is less than 2.

The taskwait directive ensures that both tasks generated in fib( ) compute i and j before return.
/* --- File fib.c ---*/
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

/* Usage: ./fib [n] */
/* Default n=10 */

int fib(int n) {
  int i, j;
  if (n<2)
     return n;
  else {
     #pragma omp task shared(i)
     i=fib(n-1);
     #pragma omp task shared(j)
     j=fib(n-2);
     #pragma omp taskwait
     return i+j;
  }
}

int main(int argc, char **argv){
  int n=10, result;
  if(argc==2) {
      char *a = argv[1];
      n = atoi(a);
  }

  #pragma omp parallel
  {
     #pragma omp single
     result = fib(n);
  }
  printf("Fibonacci number %d is %d\n", n, result);
}

Tasks?

How do tasks work with different number of threads?

Solution

Each task region will be executed by one thread. OpenMP will not use more threads to execute a single task.

Task gotchas

There are a few gotchas to be aware of. While the intention is that tasks will run in parallel, there is nothing in the specification that guarantees this behavior. You may need to check how your particular environment works.

Key Points

OpenMP can manage general parallel tasks

tasks now allow the parallelization of applications exhibiting irregular parallelism such as recursive algorithms, and pointer-based data structures.

Calculating Electrostatic Energy

Overview

Teaching: 20 min
Exercises: 10 min

Questions

How do we handle irregular tasks?

Objectives

Learn about the schedule() clause

Calculating total electrostatic potential energy of a set of charges is a common problem in molecular simulations. To compute it we need to sum interactions between all pairs of charges:

[E = \frac{charge_i*charge_j}{distance_{i,j}}]

Start with the following boilerplate:

/* --- File elect_energy.c --- */
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>

int main(int argc, char **argv) {
	struct timespec ts_start, ts_end;
	int size = 20;
	int n_charges = size*size*size;
	float scale=0.5;
	float *charge, *x, *y, *z;
	float **M;
	int i,j,k;
	float time_total;

	charge=malloc(n_charges*sizeof(float));
	x=malloc(n_charges*sizeof(float));
	y=malloc(n_charges*sizeof(float));
	z=malloc(n_charges*sizeof(float));
	M=(float**)malloc(n_charges*sizeof(float*));
	for (i=0;i<n_charges;i++)
		M[i]=malloc(n_charges*sizeof(float));

/* initialize x,y,z coordinates and charges */
	int n=0;
	for (i=0; i<size; i++)
		for (j=0; j<size; j++)
			for (k=0; k<size; k++) {
				x[n]=i*scale;
				y[n]=j*scale;
				z[n]=k*scale;
				charge[n]=0.33;
				n++;
			}
	clock_gettime(CLOCK_MONOTONIC, &ts_start);

	// Calculate electrostatic energy: sum of charge[i]*charge[j]/dist[i,j] */
	float dx, dy, dz, dist;
	double Energy=0.0f;

/* Loop calculating  Energy = charge[i]*charge[j]/dist[i,j] */

	clock_gettime(CLOCK_MONOTONIC, &ts_end);
	time_total = (ts_end.tv_sec - ts_start.tv_sec)*1e9 + (ts_end.tv_nsec - ts_start.tv_nsec);
	printf("\nTotal time is %f ms, Energy is %f\n", time_total/1e6, Energy);
}

This code generates a set of charges placed at the nodes of the cubic grid. The grid spacing is defined by the scale variable.

Write the loop computing sum of all pairwise electrostatic interactions. Recollect that because $E_{ij}=E_{ji}$ we need to compute only half of all interactions.
First, compile and run the program without OpenMP. Note how long it took to run. A millisecond is not enough to get good performance measurements.
Next, increase the size to 30 and recompile. Check the run time. Try with the size = 40.

Now comes the parallelization.

Parallelize the energy code

Decide what variable or variables should be made private, and then compile and test the code.
Run on few different numbers of CPUs. How does the performance scale?

Solution

/* --- File elect_energy_omp.c --- */
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>

int main(int argc, char **argv) {
	struct timespec ts_start, ts_end;
	int size = 30;
	int n_charges = size*size*size;
	float scale=0.5;
	float *charge, *x, *y, *z;
	float **M;
	int i,j,k;
	float time_total;

	charge=malloc(n_charges*sizeof(float));
	x=malloc(n_charges*sizeof(float));
	y=malloc(n_charges*sizeof(float));
	z=malloc(n_charges*sizeof(float));
	M=(float**)malloc(n_charges*sizeof(float*));
	for (i=0;i<n_charges;i++)
		M[i]=malloc(n_charges*sizeof(float));

/* initialize x,y,z coordinates and charges */
	int n=0;
	for (i=0; i<size; i++)
		for (j=0; j<size; j++)
			for (k=0; k<size; k++) {
				x[n]=i*scale;
				y[n]=j*scale;
				z[n]=k*scale;
				charge[n]=0.33;
				n++;
			}
	clock_gettime(CLOCK_MONOTONIC, &ts_start);

	// Calculate electrostatic energy: sum of charge[i]*charge[j]/dist[i,j] */
	float dx, dy, dz, dist;
	double Energy=0.0f;
#pragma omp parallel for private(j,dx,dy,dz,dist) reduction(+:Energy) schedule(static,50)
	for (i = 0; i < n_charges; i++) {
		for (j = i+1; j < n_charges; j++) {
			dx = x[i]-x[j];
			dy = y[i]-y[j];
			dz = z[i]-z[j];
			dist=sqrt(dx*dx + dy*dy + dz*dz);
			Energy += charge[i]*charge[j]/dist;
		}
	}

	clock_gettime(CLOCK_MONOTONIC, &ts_end);
	time_total = (ts_end.tv_sec - ts_start.tv_sec)*1e9 + (ts_end.tv_nsec - ts_start.tv_nsec);
	printf("\nTotal time is %f ms, Energy is %f\n", time_total/1e6, Energy);
}

The schedule() clause

OpenMP loop directives (parallel for, parallel do) can take several other clauses besides the private() clause we’ve already seen. One is schedule(), which allows us to specify how loop iterations are divided up among the threads.

The default is static scheduling, in which all iterations are allocated to threads before they execute any loop iterations.

In dynamic scheduling, only some of the iterations are allocated to threads at the beginning of the loop’s execution. Threads that complete their iterations are then eligible to get additional work. The allocation process continues until all the iterations have been distributed to threads.

Both scheduling types also take a chunk size; larger chunks mean less overhead and greater memory locality, smaller chunks may mean finer load balancing. You can omit the chunk size, it defaults to 1 for dynamic schedule and to $N_{iterrations}/{N_{threads}}$ for static schedule.

Bad load balancing might be what’s causing this code not to parallelize very well. As we are computing triangular part of the interaction matrix static scheduling with the default chunk size will lead to uneven load.

Let’s add a schedule(dynamic) or schedule(static,100) clause and see what happens.

Play with the schedule() clause

Try different schedule() clauses and tabulate the run times with different thread numbers. What seems to work best for this problem?

Does it change much if you grow the problem size? That is, if you make size bigger?

There’s a third option, guided, which starts with large chunks and gradually decreases the chunk size as it works through the iterations. Try it out too, if you like. With schedule(guided,<chunk>), the chunk parameter is the smallest chunk size it will try.

Key Points

Different loop scheduling may compensate for unbalanced loop iterations

Drawing the Mandelbrot set

Overview

Teaching: 20 min
Exercises: 10 min

Questions

How do we handle irregular tasks?

Objectives

Learn about the schedule() clause

The Mandelbrot set was a hot subject of computer art in the 1980s. The algorithm is quite simple:

For each point on the screen, do an iterative calculation and decide whether the calculation diverges or not. Color that spot on the screen according to how many iterations it took to diverge or black if it didn’t diverge in 1000 iterations.

# include <stdlib.h>
# include <stdio.h>
# include <math.h>
# include <time.h>

int main ( )
{
int m = 1000;
int n = 1000;

struct timespec ts_start, ts_end;
float time_total;
int c;
int count_max = 1000;
int i, j, k;
int jhi, jlo;
char *output_filename = "mandelbrot.ppm";
FILE *output_unit;
int **r, **g, **b, **count;
double x_max =   1.25;
double x_min = - 2.25;
double x, x1, x2;
double y_max =   1.75;
double y_min = - 1.75;
double y, y1, y2;
int i4_min ( int i1, int i2 );

/* Variables:
i, j - loop variables
k - pixel iteration variable
n, m - width and height of the image
x, x1, x2, y, y1, y2 - pixel coordinates mapped to the image range [n,m]
r, g, b - red, green, blue pixel color values [0-255] */

/* Allocate memory for pixels */
r=(int**)malloc(m*sizeof(int*));
for(i=0;i<m;i++)
  r[i]=malloc(n*sizeof(int));
g=(int**)malloc(m*sizeof(int*));
for(i=0;i<m;i++)
  g[i]=malloc(n*sizeof(int));
b=(int**)malloc(m*sizeof(int*));
for(i=0;i<m;i++)
  b[i]=malloc(n*sizeof(int));
count=(int**)malloc(m*sizeof(int*));
for(i=0;i<m;i++)
  count[i]=malloc(n*sizeof(int));

/* Record start time */ 
clock_gettime(CLOCK_MONOTONIC, &ts_start);

/* Carry out the iteration for each pixel, determining COUNT */
for ( i = 0; i < m; i++ )
  {
    y = ((i-1)*y_max + (m-i)*y_min)/(m-1);
    for ( j = 0; j < n; j++ )
      {
      x = ((j-1)*x_max + (n-j)*x_min)/(n-1);
      count[i][j] = 0;
      x1 = x;
      y1 = y;

      for ( k = 1; k <= count_max; k++ )
        {
          x2 = x1*x1 - y1*y1 + x;
          y2 = 2*x1*y1 + y;

          if (x2 < -2.0 || 2.0 < x2 || y2 < -2.0 || 2.0 < y2 )
          {
          count[i][j] = k;
          break;
        }
        x1 = x2;
        y1 = y2;
      }
      /* If count is 0 the point is not in set */
      if ( ( count[i][j] % 2 ) == 1 )
      {
        r[i][j] = 255;
        g[i][j] = 255;
        b[i][j] = 255;
      }
      else
      {
        c = (int) (255.0*(1-log(count[i][j])/log(count_max)));
        r[i][j] = c;
        g[i][j] = c;
        b[i][j] = c;
      }
    }
  }

clock_gettime(CLOCK_MONOTONIC, &ts_end);
time_total = (ts_end.tv_sec - ts_start.tv_sec)*1e9 + (ts_end.tv_nsec - ts_start.tv_nsec);
printf("\nTotal time is %f ms", time_total/1e6);

/* Write data to an ASCII PPM file. */
output_unit = fopen ( output_filename, "wt" );

fprintf ( output_unit, "P3\n" );
fprintf ( output_unit, "%d  %d\n", n, m );
fprintf ( output_unit, "%d\n", 255 );

for ( i = 0; i < m; i++ )
{
for ( jlo = 0; jlo < n; jlo = jlo + 4 )
  {
    jhi = i4_min ( jlo + 4, n );
    for ( j = jlo; j < jhi; j++ )
      {
        fprintf ( output_unit, "  %d  %d  %d", r[i][j], g[i][j], b[i][j] );
      }
      fprintf ( output_unit, "\n" );
  }
}

fclose ( output_unit );
printf ( "\n" );
printf ( "  Graphics data written to \"%s\".\n", output_filename );
printf ( "\n" );
return 0;
}

int i4_min ( int i1, int i2 ) {
int value;
if ( i1 < i2 )
  value = i1;
else
  value = i2;
return value;
}

First, compile and run the program without OpenMP.
Note how long it took to run. A millisecond is not enough to get good performance measurements on.
Next, increase the dimensions m,n to 3000,3000 and recompile. Check the run time.

Now comes the parallelization.

Parallelize the Mandelbrot Code

Decide what variable or variables should be made private, and then compile and test the code.

Run on few different numbers of CPUs. How does the performance scale?

The schedule() clause

The default is static scheduling, in which all iterations are allocated to threads before they execute any loop iterations. In dynamic scheduling, only some of the iterations are allocated to threads at the beginning of the loop’s execution. Threads that complete their iterations are then eligible to get additional work. The allocation process continues until all the iterations have been distributed to threads.

There’s a tradeoff between overhead (i.e., how much time is spent setting up the schedule) and load balancing (i.e., how much time is spent waiting for the most heavily-worked thread to catch up). Static scheduling has low overhead but may be badly balanced; dynamic scheduling has higher overhead. Both can also take a chunk size; larger chunks mean less overhead and greater memory locality, smaller chunks may mean finer load balancing. You can omit the chunk size, it defaults to 1.

Bad load balancing might be what’s causing this Mandelbrot code not to parallelize very well. Let’s add a schedule(dynamic) clause and see what happens.

Play with the schedule() clause

Try different schedule() clauses and tabulate the run times with different thread numbers. What seems to work best for this problem?

Does it change much if you grow the problem size? That is, if you make m,n bigger?

There’s a third option, guided, which starts with large chunks and gradually decreases the chunk size as it works through the iterations. Try it out too, if you like. With schedule(guided,<chunk>), the chunk parameter is the smallest chunk size it will try.

Key Points

Different loop scheduling may compensate for unbalanced loop iterations

Introduction to GPU Programming with OpenACC

Overview

Teaching: 20 min
Exercises: 10 min

Questions

How to program GPU?

Objectives

Learn about tools available for programming GPU

Learn about programming GPU with openACC

Introduction

As we learned in the General Parallel Computing lesson increasing performance is based on various strategies such as CPU frequency, multicore processing, vectorization, parallel distributed computing. At present performance is mostly limited by power consumption. Accelerators such as Nvidia Tesla GPUs are based on a very high level of parallelism and are capable to reach high performance at relatively low power consumption. GPUs can deliver very high performance per compute node and today, GPGPUs are the choice of hardware to accelerate computational workloads in HPC settings. Let’s look at the GPU architecture to understand why they are a good fit for various types of HPC jobs.

The GPU Architecture

CPUs are optimized to carry out tasks as quick as possible, while keeping the ability to quickly switch between operations. It’s nature is all about processing tasks in a serialized way

GPUs achieve high throughput by trading single-threaded performance in favor of several orders in magnitude more parallelism.

Tools for programming GPUs

There are several tools available for programming GPU.

CUDA. CUDA is NVIDIA-specific programming model and language. You can get the most out of your GPU with CUDA. CUDA-C and CUDA-Fortran compilers are available. Difficult to program, porting existing C/C++ or Fortran code onto the GPU with CUDA requires significant code refactoring.
OpenMP via the target construct. OpenMP on GPUs
OpenCL. Open Computing Language is a framework that allows to write programs executing across platforms consisting of CPU, GPU, FPGA, and other hardware accelerators. It is very complex and hard to program. Adoption of OpenCL is still low.
PyCuDA. Gives access to CUDA functionality from Python code.
OpenACC. OpenACC offers a quick path to accelerated computing with less programming effort.

Examples of OpenACC accelerated applications: VASP 6 (ab initio atomic scale modeling), TINKER 9 (molecular dynamics) NVIDIA GPU Accelerated VASP 6 uses OpenACC to Deliver 15X More Performance

OpenACC compilers

GCC
PGI
Nvidia HPC SDK

OpenACC Example: Solving the Discrete Poisson’s Equation using Jacobi’s Method.

Poisson’s equation arises in heat flow, diffusion, electrostatic and gravitational potential, fluid dynamics, .. etc. The governing equations of these processes are partial differential equations, which describe how each of the variables changes as a function of all the others.

[\nabla^2U(x,y)=f(x,y)]

Here $ \nabla $ is the discrete Laplace operator, $ f(x,y) $ is a known function, and $ U(x,y) $ is the function to be determined.

This equation describes, for example, steady-state temperature of a uniform square plate. To make a solution defined we need to specify boundary conditions, i.e the value of $ U $ on the boundary of our square plate. We will consider the simple case with the boundaries kept at temperature $ U = 0 $

Jacobi’s method in 2D.

Common useful iterative algorithm.
For simplicity we will consider the case where $ f(x,y)=0 $ everywhere except one hot spot in the middle of the plate.
For discretization on a uniform spatial grid $ n\times m $ the solution can be found by iteratively computing next value at the grid point $ i,j $ as an average of its four neighboring values and the function $ f(i,j) $.

[U(i,j,m+1) = ( U(i-1,j,m) + U(i+1,j,m) + U(i,j-1,m) + U(i,j+1,m) + f(i,j) )/4]

http://people.eecs.berkeley.edu/~demmel/cs267/lecture24/lecture24.html https://nielshagoort.com/2019/03/12/exploring-the-gpu-architecture/

View NVIDIA HPC-SDK documentation.

compiling: gcc laplace2d_omp.c -O3 -fopenmp -lm nvc laplace2d_omp.c -O3 -mp

running: OMP_NUM_THREADS=8 ./a.out

2048x2048, 10000 iter gcc 8 threads, 62.328627 sec nvc, v100 , 22.748066 sec

4096x4096, 10000 iter gcc, 8 threads, 122.991738 sec icc, 8 threads, 83.892027 sec nvc, v100 , 48.860725 sec

Open ACC

nvc laplace2d_acc.c -fast -acc -ta=nvidia -Minfo=accel

Let’s just drop in a single, simple OpenACC directive before each of our for loop nests in the previous code. Just after the #pragma omp lines, we add the following. (We leave the OpenMP directive in place so we retain portability to multicore CPUs.)

#pragma acc kernels This tells the compiler to generate parallel accelerator kernels (CUDA kernels in our case) for the loop nests following the directive. To compile this code, we tell the PGI compiler to compile OpenACC directives and target NVIDIA GPUs using the -acc -ta=nvidia command line options (-ta=nvidia means “target architecture = NVIDIA GPU”). We can also enable verbose information about the parallelization with the -Minfo=accel option. If you are interested in the details, check out the Appendix below.

Let’s run the program. Our test PC has an Intel Xeon X5550 CPU and an NVIDIA Tesla M2090 GPU. When I run it, I see that it takes about 75 seconds. Uh oh, that’s a bit of a slow-down compared to my parallel CPU code. What’s going on? We can modify the compilation command to enable built-in profiling by specifying -ta=nvidia,time. Now when I run it I get:

#Install numpy 1.17.5: wget https://github.com/numpy/numpy/releases/download/v1.17.5/numpy-1.17.5.tar.gz tar -xf numpy-1.17.5.tar.gz cd numpy-1.17.5 python setup.py install

Vectorization: pointer aliasing

Compiler did not identify loops as parallelizable.

MC: ml StdEnv/2020 nvhpc/20.7 cuda/11.0 nvc laplace2d_acc.c -O3 -acc -ta=tesla -Minfo=accel,time

module load StdEnv/2020 python/3.7.9 scipy-stack virtualenv env-phoebe source env-phoebe/bin/activate

git clone git://github.com/astropy/astropy.git cd astropy ASTROPY_USE_SYSTEM_EXPAT=1 pip install -e . pip install phoebe==2.3.40

working:

module load StdEnv/2016 python virtualenv env-phoebe source env-phoebe/bin/activate pip install phoebe==2.3.40

Key Points

Different loop scheduling may compensate for unbalanced loop iterations

ACENET Summer School - Directive-Based Parallel Programming with OpenMP and OpenACC

Introduction

Overview

Shared Memory OpenMP Programming

OpenMP Execution Model

Compiling OpenMP programs

Compiling C code

GCC on Compute Canada

A Very Quick Introduction to C

Preprocessor directives

Basic Syntax

Defining Functions

Using Memory

Key Points

Hello World

Overview

Adding parallelism to a program

Execution steps of the parallel “Hello, world” program

Using multiple cores

Solution

OpenMP with SLURM

Identifying threads

Download and Unpack the Code.

Pragmas and code blocks in FORTRAN

Thread ordering

Solution

Conditional Compilation

Solution

Work-Sharing Constructs

For

Stack Overflow

Sections

Exercise

Solution

Single

Key Points

Parallel Operations with Arrays

Overview

Multiplying an array by a constant

Compiling and running a serial version

Time and Size

Creating a Parallel Version

Compiling and running a parallel version

Using threads

Summing the values in a matrix

Is the result correct?

Solution

OpenMP private clause

Be Careful with Data Dependencies

Types of data dependencies:

Is There a Dependency?

Solution

Thread-safe functions

Optimizing performance

CPU cache and data locality

Avoid parallel overhead at low iteration counts

Minimize parallelization overhead

Key Points

Race Conditions with OpenMP

Overview

Introduction

Parallel numerical integration

Compiling code with mathematical functions

Parallelizing Numerical Integration

How to avoid data race conditions?

The omp critical directive

The omp atomic directive

Parallel Performance

Solution

The Optimal Way to Parallelize Integration with OpenMP

Finding the maximum value in an array

Key Points

Searching through data

Overview

Private variables

Single threaded function

Which thread runs first?

Solution

Finding the maximum value in an array

Reduction Operators