## Thursday, August 30, 2012

### Comparing dates with OpenCL - And a mktime function that scales

I recently came across a simple problem in C++, but a difficult one in OpenCL: comparing dates. Since I needed this functionality in one of my algorithms, I decided to implement my own OpenCL diffDays function and here is what it looks like the one bellow.

### OpenCL

// Globals
__constant int DaysInMonths[12] = { 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31 };

// Macros
#define isleap(y) ((((y) % 4) == 0 && ((y) % 100) != 0) || ((y) % 400) == 0)

typedef struct
{
int tm_sec;
int tm_min;
int tm_hour;
int tm_mday;
int tm_mon;
int tm_year;
int tm_wday;
int tm_yday;
int tm_isdst;
} TimeStructure;

void computeYearDay( TimeStructure* t )
{
int nbDays = 0;
for( int i=0; i<(*t).tm_mon-1; ++i )
{
nbDays += DaysInMonths[i];
}
nbDays += (*t).tm_mday;
nbDays += isleap((*t).tm_year ) ? 1 : 0; // leap years
(*t).tm_yday = nbDays;
}

int diffDays(TimeStructure fromDate, TimeStructure toDate)
{
int fndays        = 0;
float fndaysFrom = 0.f;
float fndaysTo   = 0.f;
int ndays = 0;

computeYearDay(&fromDate);
computeYearDay(&toDate);

int toYear   = 1900 + toDate.tm_year;
int fromYear = 1900 + fromDate.tm_year;

fndaysFrom =  (float)(fromDate.tm_mday);
fndaysFrom += (float)(fromDate.tm_hour)/24.f;
fndaysFrom += (float)(fromDate.tm_min)/1440.f;
fndaysFrom += (float)(fromDate.tm_sec)/86400.f;
fndaysTo  = (float)(toDate.tm_mday);
fndaysTo += (float)(toDate.tm_hour)/24.f;
fndaysTo += (float)(toDate.tm_min)/1440.f;
fndaysTo += (float)(toDate.tm_sec)/86400.f;

for(int i=fromYear; i < toYear; i++) {
ndays += isleap(i) ? 366 : 365;
}
fndays = ndays + (fndaysTo - fndaysFrom);
return -fndays;
}

The funny thing about it that when I tried to compare the results between CPU execution (using OpenMP) and the OpenCL implementation, I realized that my C++ code could not scale properly. Even worse, it was slowing down as I was increasing the number of cores. After a good profiling session using Intel Parallel Studio, I realized that the slowing bit was in the mktime function provided with Microsoft Visual C++ 2010. So I ported the OpenCL function back to C++ and that resulted in an amazing x20 increase in performance on 8 CPUs.

And I now have a mktime that scales on multicore architectures !!

### C/C++

#define isleap(y) ((((y) % 4) == 0 && ((y) % 100) != 0) || ((y) % 400) == 0)
const int DaysInMonths[12] = { 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31 };

void computeYearDay( tm& t )
{
int nbDays = 0;
for( int i=0; i<t.tm_mon-1; ++i )
{
nbDays += DaysInMonths[i];
}
nbDays += t.tm_mday;
nbDays += isleap(t.tm_year ) ? 1 : 0;
t.tm_yday = nbDays;
}

int diffDays(tm b, tm a)
{
int fndays       = 0;
float fndaysFrom = 0.f;
float fndaysTo   = 0.f;
int ndays = 0;

computeYearDay(b);
computeYearDay(a);

int toYear   = 1900 + a.tm_year;
int fromYear = 1900 + b.tm_year;

fndaysFrom  = static_cast<float>(b.tm_mday);
fndaysFrom += static_cast<float>(b.tm_hour)/24.f;
fndaysFrom += static_cast<float>(b.tm_min)/1440.f;
fndaysFrom += static_cast<float>(b.tm_sec)/86400.f;

fndaysTo  = static_cast<float>(a.tm_mday);
fndaysTo += static_cast<float>(a.tm_hour)/24.f;
fndaysTo += static_cast<float>(a.tm_min)/1440.f;
fndaysTo += static_cast<float>(a.tm_sec)/86400.f;

for(int i=fromYear; i < toYear; i++) {
ndays += isleap(i) ? 366 : 365;
}
fndays = ndays + (fndaysTo - fndaysFrom);
return -fndays;
}

## Wednesday, August 15, 2012

### Optimizing performance when using atomic operations

Here is the problem: We have a list of containers, each of them containing a variable number of floats. What we want to achieve is counting the number of containers for which the sum of float values is greater than a given limit.
The parallelism is done at a container level, meaning that each thread will process a container. The algorithm per thread is simple: walk through the list of floats and sum them up. Once this is done, there is a concurrency issue when it comes to counting the number of containers with a sum that is greater than our given limit. The simple solution is to use an atomic add in each thread when the limit condition is true. But cannot we be a bit more efficient? Of course we can!
On GPUs, threads are organized by blocks, and memory can be easily and efficiently shared between threads from the same block. So the idea is to use the atomic add at a block level rather than at a thread one. To do so, a shared 2D array is instanciated for a given 2D block so that each thread from the block has its own item in the array. By doing so, we avoid concurrency between threads from the same block.
__shared__ int r[BLOCK_SIZE][BLOCK_SIZE];

Each thread from the block can now perform its algorithm and store its result in the shared array.

// Compute the index
int x = (blockDim.x*blockIdx.x + threadIdx.x);
int y = (blockDim.y*blockIdx.y + threadIdx.y);
int index = y*(blockDim.x*blockIdx.x) + x;

// Here comes the Business stuff
Constainer p = containers[index];
int i = 0;
float sum = 0;
for( i = 0; i<p.nbElements; ++i )
{
sum += p.elements[i];
}
r[threadIdx.x][threadIdx.y] = ( sum > limit ) ? 1 : 0;

Once the block of threads is done and synchronized using the __syncthreads() cuda function, the elements from the array are summed up and the result is stored in the global counter, using an atomic add. Note that this process should only be performed by one single thread from the block.

if( threadIdx.x == 0 && threadIdx.y == 0 )  // Only one thread should do this job.
{
int s = 0;
for( int x=0; x<BLOCK_SIZE; ++x )
{
for( int y=0; y<BLOCK_SIZE; ++y )
{
s += r[x][y];
}
}
}

The full code will look like this:

struct Container
{
int nbElements;
float elements[NB_MAX_ELEMENTS_PER_CONTAINER];
};

__global__ void cuda_kernel( Container* containers, int* result, float limit )
{
__shared__ int r[BLOCK_SIZE][BLOCK_SIZE];

// Compute the index
int x = (blockDim.x*blockIdx.x + threadIdx.x);
int y = (blockDim.y*blockIdx.y + threadIdx.y);
int index = y*(blockDim.x*blockIdx.x) + x;

// Here comes the Business stuff
Constainer p = containers[index];
int i = 0;
float sum = 0;
for( i = 0; i<p.nbElements; ++i )
{
sum += p.elements[i];
}

r[threadIdx.x][threadIdx.y] = ( sum > limit ) ? 1 : 0;

if( threadIdx.x == 0 && threadIdx.y == 0 )
{
int s = 0;
for( int x=0; x<BLOCK_SIZE; ++x )
{
for( int y=0; y<BLOCK_SIZE; ++y )
{
s += r[x][y];
}
}