Correct using_gpu tutorial

3e3ba8f8 · slefrancois · 21ae3bd0 · 3e3ba8f8
--- a/doc/tutorial/using_gpu.txt
+++ b/doc/tutorial/using_gpu.txt
@@ -19,6 +19,12 @@ There are two ways currently to use a gpu, one that should support any OpenCL
 device as well as NVIDIA cards (:ref:`gpuarray`), and the old backend that
 only supports NVIDIA cards (:ref:`cuda`).
+.. warning::
+  If you want to use the new GpuArray backend, make sure to have the 
+  development version of Theano installed. The 0.8.X releases have not
+  been optimized to work correctly with the new backend.
 .. _gpuarray:
 GpuArray Backend
@@ -73,7 +79,7 @@ Use the Theano flag ``device=cuda`` to require the use of the GPU. Use the flag
  else:
      print('Used the gpu')
-The program just compute ``exp()`` of a bunch of random numbers.  Note
+The program just computes ``exp()`` of a bunch of random numbers.  Note
 that we use the :func:`theano.shared` function to make sure that the
 input *x* is stored on the GPU.
@@ -88,21 +94,22 @@ input *x* is stored on the GPU.
 .. code-block:: none
-  $ THEANO_FLAGS=device=cpu python check1.py
+  $ THEANO_FLAGS=device=cpu python gpu_tutorial1.py
  [Elemwise{exp,no_inplace}(<TensorType(float64, vector)>)]
-  Looping 1000 times took 2.6071999073 seconds
+  Looping 1000 times took 2.271284 seconds
  Result is [ 1.23178032  1.61879341  1.52278065 ...,  2.20771815  2.29967753
    1.62323285]
  Used the cpu
-  $ THEANO_FLAGS=device=cuda0 python check1.py
+  $ THEANO_FLAGS=device=cuda0 python gpu_tutorial1.py
-  Using device cuda0: GeForce GTX 275
+  Mapped name None to device cuda0: GeForce GTX 680 (cuDNN version 5004)
-  [GpuElemwise{exp,no_inplace}(<GpuArray<float64>>), HostFromGpu(gpuarray)(GpuElemwise{exp,no_inplace}.0)]
+  [GpuElemwise{exp,no_inplace}(<GpuArrayType<None>(float64, (False,))>), HostFromGpu(gpuarray)(GpuElemwise{exp,no_inplace}.0)]
-  Looping 1000 times took 2.28562092781 seconds
+  Looping 1000 times took 1.202734 seconds
  Result is [ 1.23178032  1.61879341  1.52278065 ...,  2.20771815  2.29967753
    1.62323285]
  Used the gpu
 Returning a Handle to Device-Allocated Data
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -126,8 +133,7 @@ the GPU object directly.  The following code is modified to do just that.
  rng = numpy.random.RandomState(22)
  x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
-  gx = x.transfer(None)  # Transfer variable to GPU
+  f = function([], tensor.exp(x).transfer('dev0'))
-  f = function([], tensor.exp(gx))
  print(f.maker.fgraph.toposort())
  t0 = time.time()
  for i in range(iters):
@@ -142,9 +148,9 @@ the GPU object directly.  The following code is modified to do just that.
  else:
      print('Used the gpu')
-Here ``gx = x.transfer(None)`` means "copy variable x to the GPU", with
+Here ``tensor.exp(x).transfer('None')`` means "copy ``exp(x)`` to the GPU",
-``None`` the default GPU context when not explicitly given. For information
+with ``None`` the default GPU context when not explicitly given.
-on how to set GPU contexts, see :ref:`tut_using_multi_gpu`. 
+For information on how to set GPU contexts, see :ref:`tut_using_multi_gpu`. 
 The output is
@@ -160,15 +166,14 @@ The output is
 .. code-block:: none
-  $ THEANO_FLAGS=device=cuda0 python check2.py
+  $ THEANO_FLAGS=device=cuda0 python gpu_tutorial2.py
-  Using device cuda0: GeForce GTX 275
+  Mapped name None to device cuda0: GeForce GTX 680 (cuDNN version 5004)
-  [GpuElemwise{exp,no_inplace}(<GpuArray<float64>>)]
+  [GpuElemwise{exp,no_inplace}(<GpuArrayType<None>(float64, (False,))>)]
-  Looping 1000 times took 0.455810785294 seconds
+  Looping 1000 times took 0.089194 seconds
  Result is [ 1.23178032  1.61879341  1.52278065 ...,  2.20771815  2.29967753
    1.62323285]
  Used the gpu
 While the time per call appears to be much lower than the two previous
 invocations (and should indeed be lower, since we avoid a transfer)
 the massive speedup we obtained is in part due to asynchronous nature
@@ -217,33 +222,24 @@ Tips for Improving Performance on GPU
 * Consider adding ``floatX=float32`` (or the type you are using) to your
  ``.theanorc`` file if you plan to do a lot of GPU work.
-* Use the Theano flag ``allow_gc=False``. See :ref:`gpu_async`
+* The GPU backend supports *float64* variables, but they are still slower
+  to compute than *float32*. The more *float32*, the better GPU performance
+  you will get. 
 * Prefer constructors like ``matrix``, ``vector`` and ``scalar`` to
  ``dmatrix``, ``dvector`` and ``dscalar`` because the former will give
-  you *float32* variables when ``floatX=float32``.
+  you *float32* variables and ignore the type given to ``floatX``.
-* Ensure that your output variables have a *float32* dtype and not *float64*.
+* Minimize transfers to the GPU device by using ``shared`` variables
-  The more *float32* variables are in your graph, the more work the GPU can do for
-  you.
-* Minimize transfers to the GPU device by using ``shared`` *float32* variables
  to store frequently-accessed data (see :func:`shared()<shared.shared>`).
-  When using the GPU, *float32* tensor ``shared`` variables are stored on
+  When using the GPU, tensor ``shared`` variables are stored on
  the GPU by default to eliminate transfer time for GPU ops using those variables.
 * If you aren't happy with the performance you see, try running your script with
  ``profile=True`` flag. This should print some timing information at program
  termination. Is time being used sensibly?   If an op or Apply is
  taking more time than its share, then if you know something about GPU
-  programming, have a look at how it's implemented in theano.sandbox.cuda.
+  programming, have a look at how it's implemented in theano.gpuarray.
  Check the line similar to *Spent Xs(X%) in cpu op, Xs(X%) in gpu op and
  Xs(X%) in transfer op*. This can tell you if not enough of your graph is
  on the GPU or if there is too much memory transfer.
-* Use nvcc options. nvcc supports those options to speed up some computations:
-  `-ftz=true` to `flush denormals values to zeros.
-  <https://developer.nvidia.com/content/cuda-pro-tip-flush-denormals-confidence>`_,
-  `--prec-div=false` and `--prec-sqrt=false` options to speed up
-  division and square root operation by being less precise. You can
-  enable all of them with the `nvcc.flags=--use_fast_math` Theano
-  flag or you can enable them individually as in this example:
-  `nvcc.flags=-ftz=true --prec-div=false`.
 * To investigate whether all the Ops in the computational graph are
  running on GPU, it is possible to debug or check your code by providing
  a value to `assert_no_cpu_op` flag, i.e. `warn`, for warning, `raise` for