提交 c1463e24 authored 作者: Frederic Bastien's avatar Frederic Bastien

reenable the automatic detection of when to don't load the full kernel for the…

reenable the automatic detection of when to don't load the full kernel for the gpu conv conv_patch_stack_reduce. This is just a speed up reenabled.
上级 fcc96a54
...@@ -363,7 +363,7 @@ class GpuConv(Op): ...@@ -363,7 +363,7 @@ class GpuConv(Op):
return ['cuda_ndarray.cuh','<stdio.h>'] return ['cuda_ndarray.cuh','<stdio.h>']
def c_code_cache_version(self): def c_code_cache_version(self):
return (0,10) # raise this whenever modifying any of the support_code_files return (0,11) # raise this whenever modifying any of the support_code_files
def c_support_code_apply(self, node, nodename): def c_support_code_apply(self, node, nodename):
# REMEMBER TO RAISE c_code_cache_version when changing any of these files # REMEMBER TO RAISE c_code_cache_version when changing any of these files
......
...@@ -449,6 +449,11 @@ CudaNdarray_conv_valid(const CudaNdarray *img, const CudaNdarray * kern, ...@@ -449,6 +449,11 @@ CudaNdarray_conv_valid(const CudaNdarray *img, const CudaNdarray * kern,
if(version==8||version==13) nb_split++;//force the split. if(version==8||version==13) nb_split++;//force the split.
if(version==13)full_kern=false; if(version==13)full_kern=false;
//check if we can fit the full kernel in the shared memory
if(sizeof(float)*std::max(img_size + kern_size, out_size*2) > shared_avail){
full_kern = false;
}
//thread_z is going to be ceil_intdiv(kern_len, nb_split) //thread_z is going to be ceil_intdiv(kern_len, nb_split)
// we need enough splits so that // we need enough splits so that
// a) thread_z fits in the 'z' threadIdx (i.e. is less than 64) // a) thread_z fits in the 'z' threadIdx (i.e. is less than 64)
......
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论