c++ - cuda convolution mapping -
i'm trying copy each block of threads patch of image , relative apron shared memory.
after data copyied(i used matrix) shared memory, want relations map center of mask in shared memory consider convolution , center of mask in image buffer.
i want because if try convolution of image seems center of mask in shared memory doesn't correspond center in image buffer stored in global memory.
in code below write example of simple image black , white erosion algorithm , when put result of convolution output image seems center not corresponds.
i write sample using 512x512px image
my settings are:
//block , grid size dim3 block(16,16); dim3 grid(512/(block.x),512/(block.y),1);
this kernel:
#define strel_size 5 #define tile_w 16 #define tile_h 16 #define r (strel_size/2) //size of tile image + apron #define block_w (tile_w+(2*r)) #define block_h (tile_h+(2*r)) __global__ void erode_multiple_img_sm_v2(unsigned char * buffer_in, unsigned char * buffer_out, int w,int h ){ // data cache: threadidx.x , threadidx.y __shared__ unsigned char data[tile_w +strel_size ][tile_h +strel_size ]; int col = blockidx.x * blockdim.x + threadidx.x; int row = blockidx.y * blockdim.y + threadidx.y; // global mem address of thread int gloc = row*w +col; int x, y; // image based coordinate if((col<w)&&(row<h)) { data[threadidx.x][threadidx.y]=buffer_in[gloc]; if (threadidx.y > (h-strel_size)) data[threadidx.x][threadidx.y + strel_size]=buffer_in[gloc + strel_size]; if (threadidx.x >(w-strel_size)) data[threadidx.x + strel_size][threadidx.y]=buffer_in[gloc+strel_size]; if ((threadidx.x >(w-strel_size)) && (threadidx.y > (h-strel_size))) data[threadidx.x+strel_size][threadidx.y+strel_size] = buffer_in[gloc+2*strel_size]; //wait threads finish read __syncthreads(); unsigned char min_value = 255; for(x=0;x<strel_size;x++){ for(y=0;y<strel_size;y++){ min_value = min( (data[threadidx.x+x][threadidx.y+y]) , min_value); } } buffer_out[gloc]= min_value; } }
my input image:
my output of kernel is:
where w width of image,and equal 512, h height of image,and equal 512.
i call kernel with:
erode_multiple_img_sm<<<grid,block>>>(dimage_src,dimage_dst,512,512);
the dimage_src input image array buffer not matrix, , dimage_dst output image buffer.
each buffer have size of nelem * nimg * sizeof(unsigned char) nelem=512*512 size of buffer , nimg number of image want processing in case equal 1. i'm wrong?
code update:
__global__ void erode_multiple_img_sm_v2(unsigned char * buffer_in, unsigned char * buffer_out, int w,int h ){ // data cache: threadidx.x , threadidx.y __shared__ unsigned char data[tile_w + strel_size-1 ][tile_h + strel_size-1 ]; // global mem address of thread int col = blockidx.x * blockdim.x + threadidx.x; int row = blockidx.y * blockdim.y + threadidx.y; int gloc = row*w +col; // each threads loads 4 values global memory shared mem int x, y; // image based coordinate if((col<w)&&(row<h)) { data[threadidx.x][threadidx.y] = buffer_in[gloc]; if (threadidx.y > (tile_h-strel_size+1)) data[threadidx.x][threadidx.y + strel_size-1]=buffer_in[(row + strel_size-1)*w + col]; if (threadidx.x > (tile_w-strel_size+1)) data[threadidx.x + strel_size-1][threadidx.y] = buffer_in[row*w+col + strel_size-1]; if ((threadidx.x > (tile_w-strel_size+1)) && (threadidx.y > (tile_h-strel_size+1))) data[threadidx.x + strel_size-1][threadidx.y + strel_size-1] = buffer_in[(row + strel_size-1)*w + col + strel_size-1]; //wait threads finish read __syncthreads(); unsigned char min_value = 255; for(x=0;x<strel_size;x++){ for(y=0;y<strel_size;y++){ min_value = min( (data[threadidx.x+x][threadidx.y+y]) , min_value); } } buffer_out[gloc]= min_value; } }
my output is:
updated 2(version 2 -working-):
i have implemented version of algorithm.to follow slide found useful , explained,in particular part in wich author talk convolution slide 27.
i change block , grid settings :
dim3 block(20,20); dim3 grid(512/(block.x)+ block.x,512/(block.y)+block.y);
the kernel call instead ramain same:
erode_multiple_img_sm<<<grid,block>>>(dimage_src,dimage_dst,512,512);
where argument of kernel are:
- dimage_src: buffer of unsigned char size height x width contain input image.
- dimage_dst:**buffer of unsigned char size **height x width contain output image, kernel produced.
- 512: third argument width of image.
- 512: fourth argument height of image.
remember image sample black , white version of erosion can work grayscale too.
here working kernel:
#define strel_w 5 #define strel_h 5 #define strel_size 5 #define tile_w 16 #define tile_h 16 #define r (strel_size/2) #define block_w (tile_w+(2*r)) #define block_h (tile_h+(2*r)) __global__ void erode_multiple_img_working(unsigned char * buffer_in, unsigned char * buffer_out, int w,int h ){ __shared__ unsigned char fast_acc_mat[block_w][block_h]; int ty = threadidx.y; int tx = threadidx.x; int row_o = blockidx.y * tile_w + ty; int col_o = blockidx.x * tile_h + tx; int row_i = row_o - r; int col_i = col_o - r; //in of img size if((row_i >= 0) && (row_i < h) && (col_i >= 0) && (col_i < w) ){ fast_acc_mat[ty][tx] = buffer_in[ row_i * w + col_i]; } else{ fast_acc_mat[ty][tx] = 0; } __syncthreads(); if( ty < tile_h && tx < tile_w ){ unsigned char min_val=255; for(int = 0; < strel_size; i++) { for(int j = 0; j < strel_size; j++) { min_val = min( fast_acc_mat[i+ty][j+tx] , min_val ); } } if(row_o < h && col_o < w) buffer_out[row_o * w + col_o] = min_val; } }
and eroded image(output):
i realized scheme show how part of algorithm described eric load pixel of tile in shared memory :
you need [20][20] shared mem, rather [21][21]. should changed to
__shared__ unsigned char data[tile_w + strel_size-1][tile_h + strel_size-1];
another problem data loading. correct way read (16+4) x (16+4) pixels input share memory, using (16 x 16) threads collaboratively. can divided 4 parts:
1)first part: thread(0:15, 0:15) load pixels (0:15,0:15)
2)second part: thread(0:15,12:15) load pixels (0:15, 16:19)
3)third part: thread(12:15,0:15) load pixels (16:19,0:15)
4)fourth part: thread(12:15,12:15) load pixels (16:19,16:19)
but in code messing indexing. part 2~4, of threads in thread block working, , additional boundary checking required.
for 2nd part, should use thread(0:15, 12:15) read pixel(0:15, 16:19) as
if (threadidx.y > (tile_h-strel_size)) data[threadidx.x][threadidx.y + strel_size-1] = row + strel_size-1<h ? buffer_in[(row + strel_size-1)*w + col] : 0;
the 3rd , 4th part require similar modifications as
if (threadidx.x > (tile_w-strel_size)) data[threadidx.x + strel_size-1][threadidx.y] = col + strel_size-1<w ? buffer_in[row*w+col + strel_size-1] : 0; if ((threadidx.x > (tile_w-strel_size)) && (threadidx.y > (tile_h-strel_size))) data[threadidx.x + strel_size-1][threadidx.y + strel_size-1] = (row + strel_size-1<h && col + strel_size-1<w) ? buffer_in[(row + strel_size-1)*w + col + strel_size-1] : 0;
then should able correct result image, although there 2x2 pixel shift, because convolution on (0...4, 0...4) rather (-2. .2, -2...2).
for more details, read
http://igm.univ-mlv.fr/~biri/enseignement/mii2/donnees/convolutionseparable.pdf
Comments
Post a Comment