提交 4c6bf137 authored 作者: lamblin's avatar lamblin

Merge pull request #920 from nouiz/gpu_async

Use the async gpu kernel call by default.
...@@ -31,6 +31,17 @@ You can enable faster gcc optimization with the ``cxxflags``. This list of flags ...@@ -31,6 +31,17 @@ You can enable faster gcc optimization with the ``cxxflags``. This list of flags
Use it at your own risk. Some people warned that the ``-ftree-loop-distribution`` optimization resulted in wrong results in the past. Use it at your own risk. Some people warned that the ``-ftree-loop-distribution`` optimization resulted in wrong results in the past.
Also the ``-march=native`` flag must be used with care if you have NFS. In that case, you MUST set the compiledir to a local path of the computer. Also the ``-march=native`` flag must be used with care if you have NFS. In that case, you MUST set the compiledir to a local path of the computer.
Faster Theano function
----------------------
You can set the Theano `allow_gc` to `False` to get a speed up by
using more memory. By default, Theano free intermediate results when
we don't need them anymore. Doing so prevent us from reusing this
memory. So disabling the gc will keep all intermediate results memory
space to allow to reuse them during the next call to the same Theano
function if they are of the good shape. The shape could change if the
shape of the inputs change.
Related Projects Related Projects
---------------- ----------------
......
...@@ -256,13 +256,13 @@ what to expect right now: ...@@ -256,13 +256,13 @@ what to expect right now:
that data. Getting GPU performance largely hinges on making data transfer to that data. Getting GPU performance largely hinges on making data transfer to
the device pay off. the device pay off.
Tips for Improving Performance on GPU Tips for Improving Performance on GPU
------------------------------------- -------------------------------------
* Consider * Consider
adding ``floatX=float32`` to your ``.theanorc`` file if you plan to do a lot of adding ``floatX=float32`` to your ``.theanorc`` file if you plan to do a lot of
GPU work. GPU work.
* Use the Theano flag ``allow_gc=False``. See :ref:`gpu_async`
* Prefer * Prefer
constructors like ``matrix``, ``vector`` and ``scalar`` to ``dmatrix``, ``dvector`` and constructors like ``matrix``, ``vector`` and ``scalar`` to ``dmatrix``, ``dvector`` and
``dscalar`` because the former will give you *float32* variables when ``dscalar`` because the former will give you *float32* variables when
...@@ -285,6 +285,25 @@ Tips for Improving Performance on GPU ...@@ -285,6 +285,25 @@ Tips for Improving Performance on GPU
This can tell you if not enough of your graph is on the GPU or if there This can tell you if not enough of your graph is on the GPU or if there
is too much memory transfer. is too much memory transfer.
.. _gpu_async:
GPU Async capabilities
----------------------
Ever since Theano 0.6 we started to use the asynchronous capability of
GPUs. This allows us to be faster but with the possibility that some
errors may be raised later than when they should occur. This can cause
difficulties when profiling Theano apply nodes. There is a NVIDIA
driver feature to help with these issues. If you set the environment
variable CUDA_LAUNCH_BLOCKING=1 then all kernel calls will be
automatically synchronized. This reduces performance but provides good
profiling and appropriately placed error messages.
This feature interacts with Theano garbage collection of intermediate
results. To get the most of this feature, you need to disable the gc
as it inserts synchronization points in the graph. Set the Theano flag
``allow_gc=False`` to get even faster speed! This will raise the memory
usage.
Changing the Value of Shared Variables Changing the Value of Shared Variables
-------------------------------------- --------------------------------------
...@@ -606,9 +625,3 @@ have to be jointly optimized explicitly in the code.) ...@@ -606,9 +625,3 @@ have to be jointly optimized explicitly in the code.)
Modify and execute to support *stride* (i.e. so as not constrain the input to be *C-contiguous*). Modify and execute to support *stride* (i.e. so as not constrain the input to be *C-contiguous*).
...@@ -88,6 +88,11 @@ int device_free(void *ptr) ...@@ -88,6 +88,11 @@ int device_free(void *ptr)
if(!g_gpu_context_active) { if(!g_gpu_context_active) {
return 0; return 0;
} }
// We need sync as the Theano's GC could remove intermediate variable that
// are still needed as the gpu kernel are running or in the queue.
cudaThreadSynchronize();
cudaError_t err = cudaFree(ptr); cudaError_t err = cudaFree(ptr);
if (cudaSuccess != err) if (cudaSuccess != err)
{ {
...@@ -1044,6 +1049,12 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){ ...@@ -1044,6 +1049,12 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
//-10 could be any value different then 0. //-10 could be any value different then 0.
int cpu_err_var=-10; int cpu_err_var=-10;
// We are not 100% sure that cudaMemcpy wait that the async gpu kernel are
// finished before doing the transfer. So we add this explicit sync as it
// is pretty fast. In a python loop, I ran 1 000 000 call in 1 second.
// It is better to be save and not significatively slower then not safe.
cudaThreadSynchronize();
err = cudaMemcpy(&cpu_err_var, err_var, sizeof(int), err = cudaMemcpy(&cpu_err_var, err_var, sizeof(int),
cudaMemcpyDeviceToHost); cudaMemcpyDeviceToHost);
if (cudaSuccess != err) { if (cudaSuccess != err) {
...@@ -2446,6 +2457,16 @@ GetDeviceMemInfo(PyObject* _unused, PyObject* dummy) ...@@ -2446,6 +2457,16 @@ GetDeviceMemInfo(PyObject* _unused, PyObject* dummy)
return PyTuple_Pack(2, PyLong_FromLong(free), PyLong_FromLong(total)); return PyTuple_Pack(2, PyLong_FromLong(free), PyLong_FromLong(total));
} }
/*
* Synchronize with all the gpu device stream.
*/
PyObject *
CudaNdarray_synchronize(PyObject* _unused, PyObject* dummy)
{
cudaThreadSynchronize();
Py_INCREF(Py_None);
return Py_None;
}
#if COMPUTE_GPU_MEM_USED #if COMPUTE_GPU_MEM_USED
/* /*
* Return the size in bytes that Theano currently have allocated on the gpu. * Return the size in bytes that Theano currently have allocated on the gpu.
...@@ -2953,6 +2974,7 @@ static PyMethodDef module_methods[] = { ...@@ -2953,6 +2974,7 @@ static PyMethodDef module_methods[] = {
{"filter", filter, METH_VARARGS, "filter(obj, broadcastable, strict, storage) returns a CudaNdarray initialized to obj if it matches the constraints of broadcastable. strict=True prevents any numeric casting. If storage is a CudaNdarray it may be overwritten and used as the return value."}, {"filter", filter, METH_VARARGS, "filter(obj, broadcastable, strict, storage) returns a CudaNdarray initialized to obj if it matches the constraints of broadcastable. strict=True prevents any numeric casting. If storage is a CudaNdarray it may be overwritten and used as the return value."},
{"outstanding_mallocs", outstanding_mallocs, METH_VARARGS, "how many more mallocs have been called than free's"}, {"outstanding_mallocs", outstanding_mallocs, METH_VARARGS, "how many more mallocs have been called than free's"},
{"from_gpu_pointer", CudaNdarray_from_gpu_pointer, METH_VARARGS, "Used to create a CudaNdarray from already allocated memory on the gpu.(example by pycuda)"}, {"from_gpu_pointer", CudaNdarray_from_gpu_pointer, METH_VARARGS, "Used to create a CudaNdarray from already allocated memory on the gpu.(example by pycuda)"},
{"synchronize", CudaNdarray_synchronize, METH_NOARGS, "Used to synchronize the device"},
{NULL, NULL, NULL, NULL} /* Sentinel */ {NULL, NULL, NULL, NULL} /* Sentinel */
}; };
......
...@@ -27,7 +27,7 @@ typedef float real; ...@@ -27,7 +27,7 @@ typedef float real;
#define NUM_VECTOR_OP_THREADS_PER_BLOCK 256 //Should be read from device properties. (#10) #define NUM_VECTOR_OP_THREADS_PER_BLOCK 256 //Should be read from device properties. (#10)
#endif #endif
#if 0 #if 1
// Do not wait after every kernel & transfer. // Do not wait after every kernel & transfer.
#define CNDA_THREAD_SYNC #define CNDA_THREAD_SYNC
#else #else
......
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论