Sunday, December 18, 2011

Superfast Bootstrapping

In statistics, bootstrapping is a computer-based method for assigning measures of accuracy to sample estimates (Efron and Tibshirani 1994). This technique allows estimation of the sample distribution of almost any statistic using only very simple methods. Generally, it falls in the broader class of resampling methods.

Introducing GPU acceleration in MS Excel user-defined functions offers an easy-to-use solution to the masses.

Cuda Kernels


/*
* --------------------------------------------------------------------------------
* Bootstrap kernels
* --------------------------------------------------------------------------------
*/


__global__ void kernel_bs_normalizeRandoms(
unsigned int* data,
unsigned int* randoms,
unsigned int nbElements )
{
  // Compute the index
  unsigned int x = blockIdx.x*blockDim.x+threadIdx.x;
  unsigned int y = blockIdx.y*blockDim.y+threadIdx.y;
  unsigned int index = (y*blockDim.x) + x;

  unsigned int randIndex = index*nbElements;

  // Compute performance
  for( int i(0); i<nbElements; ++i )
  {
    if( index==0 )
    {
      data[i] = data[i]%nbElements;
    }
    randoms[randIndex+i] = randoms[randIndex+i]%nbElements;
  }
}

__global__ void kernel_bs_resample(
  unsigned int* data,
  unsigned int* randoms,
  float* series,
  unsigned int nbSeries,
  unsigned int nbElements )
{
  // Compute the index
  unsigned int x = blockIdx.x*blockDim.x+threadIdx.x;
  unsigned int y = blockIdx.y*blockDim.y+threadIdx.y;
  unsigned int index = (y*blockDim.x) + x;

  unsigned int randIndex = index*nbElements;

  // Compute average
  int sum(0);
  for( int i(0); i<nbElements; ++i )
  {
    int r = randoms[randIndex+i];
    if( r>=0 && r<nbElements )
    {
      sum += data[r];
    }
  }
  // Store average
  if( index < nbSeries )
  {
    series[index] = (sum/nbElements);
  }
}

__global__ void kernel_bs_distribute(
  float* series,
  float scale,
  unsigned int nbElements,
  unsigned int* distribution,
  unsigned int distributionSize )
{
  // Compute the index
  unsigned int x = blockIdx.x*blockDim.x+threadIdx.x;
  unsigned int y = blockIdx.y*blockDim.y+threadIdx.y;
  unsigned int index = (y*blockDim.x) + x;

  float position = series[index] * scale;

  if( position>0 && position<distributionSize)
  {
    // TODO: Can be optimized with shared memory within each block
    atomicAdd(&distribution[(unsigned int)position],1);
  }
}

Tuesday, December 13, 2011

Half a million Monte Carlo simulations per second in Microsoft Excel 2010

Half a million Monte Carlo simulations per second! Release the power of your GPU from MS Excel 2010. Implementing user defined functions using CUDA ot OpenCL opens a whole new dimension in supercomputing in Microsoft Office.


The algorithm is composed of three parts:
1. Generate random numbers for n series
2. Compute the performance for each of these series using
3. Distribute performances into the resulting set.

Step 1: This is done using the curand nVidia librairy.
Step 2: This is performed by the performanceStorageKernel function described bellow. 
Step 3: The trick here is to use an atomic operation (atomicAdd) to generate the resulting Gauss-like curve. Note that atomic operations are only availble on architectures equals or higher to sm_13. This is performed by the frequenciesKernel function.


Cuda Kernel

#include <curand.h>
#include <cuda.h>
#include <math.h>
#include <windows.h>

cudaDeviceProp _dDeviceProp;

// Device buffers
float* _dObjects;
float* _dSeries;
float* _dPerformances;
int*   _dFrequencies;

// Variables
int _dNbObjects;
int _dNbFrequencies;
int _dNbSeries;

/**
* @brief Initializes the testcase by
* - Allocating the GPU memory
* - Copying the buffer from RAM to GPU memory.
* This function is executed on the Host
* @ param hBuffer Pointer to array of floats in host memory
* @ param N Number of objects in the array (an Objects consists of 6 floats)
*/
extern "C"
void initialize_device( int NbSeries, int NbObjects, int NbFrequencies )
{
  // Testcase initialization
  _dNbObjects = NbObjects;
  _dNbFrequencies = NbFrequencies;
  _dNbSeries = NbSeries;

  // Device allocation
  cudaMalloc( (void**)&_dObjects, NbSeries*NbObjects*sizeof(float) );
  cudaMalloc( (void**)&_dSeries, NbSeries*sizeof(float) );
  cudaMalloc( (void**)&_dFrequencies, NbFrequencies*sizeof(int) );
  cudaMalloc( (void**)&_dPerformances, NbObjects*sizeof(float) );
}

/**
* @brief
*/
__global__ void performanceStorageKernel( float* series, float* objects, int nbObjects, float* performances )
{
  // Compute the index
  unsigned int x = blockIdx.x*blockDim.x+threadIdx.x;
  unsigned int y = blockIdx.y*blockDim.y+threadIdx.y;
  unsigned int index = (y*blockDim.x) + x;

  int objectsIndex = index*nbObjects;
  // Compute performance
  __shared__ float localPerformance[2];
  localPerformance[0] = 1.0; // current performance
  localPerformance[1] = 0.0; // previous performance
  for( int i(0); i<nbObjects; ++i ) {
    localPerformance[1] = localPerformance[0];
    localPerformance[0] = (1.0+objects[objectsIndex+i])*localPerformance[1];
   
   if( index == 0 ) performances[i] = localPerformance[0];
  }
  // Store performance
  series[index] = localPerformance[0] - 1.0;
}

/**
* @brief Kernel function to be executed on the GPU
* @param ptr Pointer to an array of floats stored in GPU memory
*/
__global__ void frequenciesKernel( float* series, float range, int* frequencies, int nbFrequencies )
{
  // Compute the index
  unsigned int x = blockIdx.x*blockDim.x+threadIdx.x;
  unsigned int y = blockIdx.y*blockDim.y+threadIdx.y;
  unsigned int index = (y*blockDim.x) + x;

  float v = series[index]-(-range/2);
  int position = (v/(range/nbFrequencies));
  atomicAdd(&frequencies[position],1);
}

/**
* @brief Run the kernel on the GPU.
* This function is executed on the Host.
*/
extern "C"
void run_kernel( float range, float mu, float sigma, int random, int nbThreadsPerBlock )
{
  // Create pseudo-random number generator
  curandGenerator_t gen;
  curandCreateGenerator(&gen, CURAND_RNG_PSEUDO_DEFAULT);
  curandSetPseudoRandomGeneratorSeed(gen, random);
  curandGenerateNormal(gen, _dObjects, _dNbSeries*_dNbObjects, mu, sigma );
  curandDestroyGenerator(gen);

  dim3 block(8, 8, 1);
  int gridDim = (int)sqrt(float(_dNbSeries));
  dim3 gridPerformances( gridDim/block.x, gridDim/block.y, 1);
  performanceStorageKernel<<<gridPerformances,block>>>(_dSeries, _dObjects, _dNbObjects, _dPerformances );

  // Reset memory
  cudaMemset( _dFrequencies, 0, _dNbFrequencies*sizeof(int) );

  // compute Frequencies
  gridDim = (int)sqrt(float(_dNbSeries));
  dim3 gridFrequencies( gridDim/block.x, gridDim/block.y, 1);
  frequenciesKernel<<<gridFrequencies,block>>>( _dSeries, range, _dFrequencies, _dNbFrequencies );
}

/*
* This function is executed on the Host.
* @brief Copy the data back to the Host and releases the buffer on
* GPU device.
* This function is executed on the Host
* @ param hBuffer Pointer to array of floats in host memory
* @ param N Number of objects in the array (an Objects consists of 6 floats)
*/
extern "C"
void device_to_host( int* hOutFrequencies, float* hOutPerformances, float* hOutSeries )
{
  cudaMemcpy( hOutFrequencies, _dFrequencies, _dNbFrequencies*sizeof(int),     cudaMemcpyDeviceToHost);
  cudaMemcpy( hOutPerformances, _dPerformances, _dNbObjects*sizeof(float), cudaMemcpyDeviceToHost);
}

/*
* This function is executed on the Host.
* @brief Releases the buffer on GPU device.
* This function is executed on the Host
*/
extern "C"
void destroy_device()
{
  cudaFree( _dObjects );
  cudaFree( _dFrequencies );
  cudaFree( _dSeries );
  cudaFree( _dPerformances );
}


Sunday, December 11, 2011

Hi all,

Based on the fluid demo provided by nVidia in the GPU toolkit, I added Kinect support in order to interact with particles. This demo is a mix of OpenGL, CUDA and Kinect technologies.

The source code is a 32bits Visual Studio 2010 projet, and can be downloaded here: KinectFluids.zip

Sunday, October 9, 2011

My tribute to Steve


I have to be honest, I've never been an Apple fan, not geek enough to me I guess, but I'll definitely miss the man. I coded this short and simple tribute movie, not only for steve, but also because this is the first time in the short history of computer science that we lose such an emblematic icon. May Steven rest in peace (I'm pretty sure he'll get the people up there to buy an iPhone 4S!)

Thursday, June 30, 2011

Power Textures

Using textures to compute intersections allows creation of complex objects with minimal computation.

Monday, June 20, 2011

When Cuda meets Kinect...

When Cuda meets Kinect, realtime raytracing takes a whole new dimension. Thanks to Microsoft, the Kinect SDK allows creation of human 3D structures than can be inserted into the real-time ray-tracing engine.

Friday, June 10, 2011

Realtime raytracing including Kinect video capture

Hi, I finally made it! I wanted to introduce real-time video capture into my raytraced world and there it is. It's still running pretty smoothly on my nvidia GTX260 (around 25fps in 1280x720). This was acheived using Kinect, mainly because it's much simpler to get an array of bytes using this device, rather than using the default DirectShow stuff. 




C++ code for Kinect initialisation:

// Variables

HANDLE m_hNextDepthFrameEvent; 
HANDLE m_hNextVideoFrameEvent;
HANDLE m_hNextSkeletonEvent;
HANDLE m_pVideoStreamHandle;
HANDLE m_pDepthStreamHandle;

char* m_hVideo;
char* m_hDepth;

// Initialize Kinect
NuiInitialize( NUI_INITIALIZE_FLAG_USES_DEPTH_AND_PLAYER_INDEX | NUI_INITIALIZE_FLAG_USES_SKELETON | NUI_INITIALIZE_FLAG_USES_COLOR);

m_hNextDepthFrameEvent = CreateEvent( NULL, TRUE, FALSE, NULL ); 
m_hNextVideoFrameEvent = CreateEvent( NULL, TRUE, FALSE, NULL ); 
m_hNextSkeletonEvent   = CreateEvent( NULL, TRUE, FALSE, NULL );

m_skeletons = CreateEvent( NULL, TRUE, FALSE, NULL );  
NuiSkeletonTrackingEnable( m_skeletons, NUI_SKELETON_TRACKING_FLAG_ENABLE_SEATED_SUPPORT );

NuiImageStreamOpen( NUI_IMAGE_TYPE_COLOR,                  NUI_IMAGE_RESOLUTION_640x480, 0, 2, m_hNextVideoFrameEvent, &m_pVideoStreamHandle );
NuiImageStreamOpen( NUI_IMAGE_TYPE_DEPTH_AND_PLAYER_INDEX, NUI_IMAGE_RESOLUTION_320x240, 0, 2, m_hNextDepthFrameEvent, &m_pDepthStreamHandle );

NuiCameraElevationSetAngle( 0 );

For each iteration, the Kinect frames need to be retreived:



// Video
const NUI_IMAGE_FRAME* pImageFrame = 0;
WaitForSingleObject (m_hNextVideoFrameEvent,INFINITE); 
HRESULT status = NuiImageStreamGetNextFrame( m_pVideoStreamHandle, 0, &pImageFrame ); 
if(( status == S_OK) && pImageFrame ) 
{
  INuiFrameTexture* pTexture = pImageFrame->pFrameTexture;
  NUI_LOCKED_RECT LockedRect;
  pTexture->LockRect( 0, &LockedRect, NULL, 0 ) ; 
  if( LockedRect.Pitch != 0 ) 
  {
    m_hVideo = (char*) LockedRect.pBits;
  }
}

// Depth
const NUI_IMAGE_FRAME* pDepthFrame = 0;
WaitForSingleObject (m_hNextDepthFrameEvent,INFINITE); 
status = NuiImageStreamGetNextFrame( m_pDepthStreamHandle, 0, &pDepthFrame ); 
if(( status == S_OK) && pDepthFrame ) 
{
  INuiFrameTexture* pTexture = pDepthFrame->pFrameTexture;
  if( pTexture ) 
  {
    NUI_LOCKED_RECT LockedRectDepth;
    pTexture->LockRect( 0, &LockedRectDepth, NULL, 0 ) ; 
    if( LockedRectDepth.Pitch != 0 ) 
    {
      m_hDepth = (char*) LockedRectDepth.pBits;
    }
  }
}
NuiImageStreamReleaseFrame( m_pVideoStreamHandle, pImageFrame ); 
NuiImageStreamReleaseFrame( m_pDepthStreamHandle, pDepthFrame ); 

Finaly, resources should be released:

CloseHandle(m_hNextDepthFrameEvent); 
CloseHandle(m_hNextVideoFrameEvent); 
NuiShutdown();

Wednesday, May 4, 2011

Kinect playground

Here is a simple video demonstrating the use of Microsoft's Kinect for realtime 3D fun:

Real-time ray-tracing

I've always been fascinated by computer images and my dream was to be able to create one of those with my own programs. After a few years of research, my goal is still the same. I started to draw 3D math functions with hidden and lighted faces using basic algorithms but it was not realistic enough to me. I've always been thinking that ray-tracing was the best technology ever so I decided to make my first ray-tracer on my last year of B.s.c. in computer studies. The first implementation was done in C++ and it used to take ages to compute a single image. But this was in the late 90's and things have changed in the recent years with the introduction of GPGPU programming. So yes, my old dreams have come true, thanks to nVidia, realtime ray-tracing can be acheived on modern graphics cards and beside is a sample of my first results. This 1024x768 image takes 15 milliseconds to compute on a GTX260 nVidia card... just unbeleivable... and it's not even optimized.

So keep in touch for new exciting updates :)