Merge pull request #920 from nouiz/gpu_async

Use the async gpu kernel call by default.

Merge pull request #920 from nouiz/gpu_async
4c6bf137 · lamblin · f1ac289f · e7ed20c6 · 4c6bf137 · 4c6bf137
--- a/doc/tutorial/faq.txt
+++ b/doc/tutorial/faq.txt
@@ -31,6 +31,17 @@ You can enable faster gcc optimization with the ``cxxflags``. This list of flags
 Use it at your own risk. Some people warned that the ``-ftree-loop-distribution`` optimization resulted in wrong results in the past.
 Also the ``-march=native`` flag must be used with care if you have NFS. In that case, you MUST set the compiledir to a local path of the computer.
+Faster Theano function
+----------------------
+You can set the Theano `allow_gc` to `False` to get a speed up by
+using more memory. By default, Theano free intermediate results when
+we don't need them anymore. Doing so prevent us from reusing this
+memory.  So disabling the gc will keep all intermediate results memory
+space to allow to reuse them during the next call to the same Theano
+function if they are of the good shape. The shape could change if the
+shape of the inputs change.
 Related Projects
 ----------------

--- a/doc/tutorial/using_gpu.txt
+++ b/doc/tutorial/using_gpu.txt
@@ -256,13 +256,13 @@ what to expect right now:
  that data.  Getting GPU performance largely hinges on making data transfer to
  the device pay off.
 Tips for Improving Performance on GPU
 -------------------------------------
 * Consider 
  adding ``floatX=float32`` to your ``.theanorc`` file if you plan to do a lot of
  GPU work.
+* Use the Theano flag ``allow_gc=False``. See :ref:`gpu_async`
 * Prefer  
  constructors like ``matrix``, ``vector`` and ``scalar`` to ``dmatrix``, ``dvector`` and
  ``dscalar`` because the former will give you *float32* variables when
@@ -285,6 +285,25 @@ Tips for Improving Performance on GPU
  This can tell you if not enough of your graph is on the GPU or if there
  is too much memory transfer.
+.. _gpu_async:
+GPU Async capabilities
+----------------------
+Ever since Theano 0.6 we started to use the asynchronous capability of
+GPUs. This allows us to be faster but with the possibility that some
+errors may be raised later than when they should occur. This can cause
+difficulties when profiling Theano apply nodes. There is a NVIDIA
+driver feature to help with these issues. If you set the environment
+variable CUDA_LAUNCH_BLOCKING=1 then all kernel calls will be
+automatically synchronized. This reduces performance but provides good
+profiling and appropriately placed error messages.
+This feature interacts with Theano garbage collection of intermediate
+results. To get the most of this feature, you need to disable the gc
+as it inserts synchronization points in the graph. Set the Theano flag
+``allow_gc=False`` to get even faster speed! This will raise the memory
+usage.
 Changing the Value of Shared Variables
 --------------------------------------
@@ -606,9 +625,3 @@ have to be jointly optimized explicitly in the code.)
 Modify and execute to support *stride* (i.e. so as not constrain the input to be *C-contiguous*).
--- a/theano/sandbox/cuda/cuda_ndarray.cu
+++ b/theano/sandbox/cuda/cuda_ndarray.cu
@@ -88,6 +88,11 @@ int device_free(void *ptr)
    if(!g_gpu_context_active) {
        return 0;
    }
+    // We need sync as the Theano's GC could remove intermediate variable that
+    // are still needed as the gpu kernel are running or in the queue.
+    cudaThreadSynchronize();
    cudaError_t err =  cudaFree(ptr);
    if (cudaSuccess != err)
    {
@@ -1044,6 +1049,12 @@ CudaNdarray_TakeFrom(CudaNdarray * self, PyObject *args){
    //-10 could be any value different then 0.
    int cpu_err_var=-10;
+    // We are not 100% sure that cudaMemcpy wait that the async gpu kernel are
+    // finished before doing the transfer. So we add this explicit sync as it
+    // is pretty fast. In a python loop, I ran 1 000 000 call in 1 second.
+    // It is better to be save and not significatively slower then not safe.
+    cudaThreadSynchronize();
    err = cudaMemcpy(&cpu_err_var, err_var, sizeof(int),
                     cudaMemcpyDeviceToHost);
    if (cudaSuccess != err) {
@@ -2446,6 +2457,16 @@ GetDeviceMemInfo(PyObject* _unused, PyObject* dummy)
    return PyTuple_Pack(2, PyLong_FromLong(free), PyLong_FromLong(total));
 }
+/*
+ * Synchronize with all the gpu device stream.
+ */
+PyObject *
+CudaNdarray_synchronize(PyObject* _unused, PyObject* dummy)
+{
+    cudaThreadSynchronize();
+    Py_INCREF(Py_None);
+    return Py_None;
+}
 #if COMPUTE_GPU_MEM_USED
 /*
 * Return the size in bytes that Theano currently have allocated on the gpu.
@@ -2953,6 +2974,7 @@ static PyMethodDef module_methods[] = {
    {"filter", filter, METH_VARARGS, "filter(obj, broadcastable, strict, storage) returns a CudaNdarray initialized to obj if it matches the constraints of broadcastable.  strict=True prevents any numeric casting. If storage is a CudaNdarray it may be overwritten and used as the return value."},
    {"outstanding_mallocs", outstanding_mallocs, METH_VARARGS, "how many more mallocs have been called than free's"},
    {"from_gpu_pointer", CudaNdarray_from_gpu_pointer, METH_VARARGS, "Used to create a CudaNdarray from already allocated memory on the gpu.(example by pycuda)"},
+    {"synchronize", CudaNdarray_synchronize, METH_NOARGS, "Used to synchronize the device"},
    {NULL, NULL, NULL, NULL}  /* Sentinel */
 };

--- a/theano/sandbox/cuda/cuda_ndarray.cuh
+++ b/theano/sandbox/cuda/cuda_ndarray.cuh
@@ -27,7 +27,7 @@ typedef float real;
 #define NUM_VECTOR_OP_THREADS_PER_BLOCK     256  //Should be read from device properties. (#10)
 #endif
-#if 0
+#if 1
 // Do not wait after every kernel & transfer.
 #define CNDA_THREAD_SYNC
 #else