Merge pull request #4500 from slefrancois/gpu_out_sandbox

Update doc with instructions for using new gpu backend

Merge pull request #4500 from slefrancois/gpu_out_sandbox
ec0419a6 · Pascal Lamblin · 0044349f · 974bd517 · ec0419a6 · ec0419a6
--- a/.gitignore
+++ b/.gitignore
@@ -37,3 +37,4 @@ Theano.suo
 .ipynb_checkpoints
 .pydevproject
 .ropeproject
+core
\ No newline at end of file
--- a/doc/extending/extending_theano.txt
+++ b/doc/extending/extending_theano.txt
@@ -681,8 +681,8 @@ For instance, to verify the Rop method of the DoubleOp, you can use this:
 Testing GPU Ops
 ^^^^^^^^^^^^^^^

-Ops to be executed on the GPU should inherit from the
-``theano.sandbox.cuda.GpuOp`` and not ``theano.Op``. This allows
+When using the old GPU backend, Ops to be executed on the GPU should inherit
+from ``theano.sandbox.cuda.GpuOp`` and not ``theano.Op``. This allows
 Theano to distinguish them. Currently, we use this to test if the
 NVIDIA driver works correctly with our sum reduction code on the GPU.


--- a/doc/install.txt
+++ b/doc/install.txt
@@ -375,7 +375,7 @@ If ``theano-nose`` is not found by your shell, you will need to add

    If you want GPU-related tests to run on a specific GPU device, and not
    the default one, you should use :attr:`~config.init_gpu_device`.
-    For instance: ``THEANO_FLAGS=device=cpu,init_gpu_device=gpu1``.
+    For instance: ``THEANO_FLAGS=device=cpu,init_gpu_device=cuda1``.

    See :ref:`libdoc_config` for more information on how to change these
    configuration options.
@@ -508,25 +508,25 @@ Any one of them is enough.

    :ref:`Ubuntu instructions <install_ubuntu_gpu>`.

-
+Next, install `libgpuarray <http://deeplearning.net/software/libgpuarray/installation.html>`_.

 Once that is done, the only thing left is to change the ``device`` option to name the GPU device in your
 computer, and set the default floating point computations to float32.
-For example: ``THEANO_FLAGS='cuda.root=/path/to/cuda/root,device=gpu,floatX=float32'``.
+For example: ``THEANO_FLAGS='cuda.root=/path/to/cuda/root,device=cuda,floatX=float32'``.
 You can also set these options in the .theanorc file's ``[global]`` section:

     .. code-block:: cfg

        [global]
-        device = gpu
+        device = cuda
        floatX = float32

 Note that:

-    * If your computer has multiple GPUs and you use 'device=gpu', the driver
-      selects the one to use (usually gpu0).
-    * You can use the program nvida-smi to change this policy.
-    * You can choose one specific GPU by specifying 'device=gpuX', with X the
+    * If your computer has multiple GPUs and you use 'device=cuda', the driver
+      selects the one to use (usually cuda0).
+    * You can use the program ``nvidia-smi`` to change this policy.
+    * You can choose one specific GPU by specifying 'device=cudaX', with X the
      the corresponding GPU index (0, 1, 2, ...)
    * By default, when ``device`` indicates preference for GPU computations,
      Theano will fall back to the CPU if there is a problem with the GPU.
@@ -794,6 +794,8 @@ setup CUDA, but be aware of the following caveats:
     toggle your GPU on, which can be done with
     `gfxCardStatus <http://codykrieger.com/gfxCardStatus>`__.

+Next, install `libgpuarray <http://deeplearning.net/software/libgpuarray/installation.html>`_.
+
 Once your setup is complete, head to :ref:`using_gpu` to find how to verify
 everything is working properly.


--- a/doc/install_ubuntu.txt
+++ b/doc/install_ubuntu.txt
@@ -43,7 +43,7 @@ For Ubuntu 11.10 through 14.04:

    sudo apt-get install python-numpy python-scipy python-dev python-pip python-nose g++ libopenblas-dev git
    sudo pip install Theano
-    
+
 On 14.04, this will install Python 2 by default. If you want to use Python 3:

 .. code-block:: bash
@@ -104,30 +104,30 @@ For Ubuntu 11.04:
   The development version of Theano supports Python 3.3 and
   probably supports Python 3.2, but we do not test on it.

-    
+
 Bleeding Edge Installs
 ----------------------

-If you would like, instead, to install the bleeding edge Theano (from github) 
-such that you can edit and contribute to Theano, replace the `pip install Theano` 
+If you would like, instead, to install the bleeding edge Theano (from github)
+such that you can edit and contribute to Theano, replace the `pip install Theano`
 command with:

 .. code-block:: bash

    git clone git://github.com/Theano/Theano.git
-    cd Theano 
+    cd Theano
    python setup.py develop --user
    cd ..

 VirtualEnv
 ----------
-    
-If you would like to install Theano in a VirtualEnv, you will want to pass the 
-`--system-site-packages` flag when creating the VirtualEnv so that it will pick up 
+
+If you would like to install Theano in a VirtualEnv, you will want to pass the
+`--system-site-packages` flag when creating the VirtualEnv so that it will pick up
 the system-provided `Numpy` and `SciPy`.

 .. code-block:: bash
-    
+
    virtualenv --system-site-packages -p python2.7 theano-env
    source theano-env/bin/activate
    pip install Theano
@@ -208,7 +208,7 @@ Updating Bleeding Edge Installs
 Change to the Theano directory and run:

 .. code-block:: bash
-    
+
    git pull


@@ -303,7 +303,7 @@ Test GPU configuration

 .. code-block:: bash

-    THEANO_FLAGS=floatX=float32,device=gpu python /usr/lib/python2.*/site-packages/theano/misc/check_blas.py
+    THEANO_FLAGS=floatX=float32,device=cuda python /usr/lib/python2.*/site-packages/theano/misc/check_blas.py

 .. note::


--- a/doc/install_windows.txt
+++ b/doc/install_windows.txt
@@ -423,16 +423,16 @@ Create a test file containing:
   print("NP time: %f[s], theano time: %f[s] (times should be close when run on CPU!)" %(
                                              np_end-np_start, t_end-t_start))
   print("Result difference: %f" % (np.abs(AB-tAB).max(), ))
-   
+
 .. testoutput::
   :hide:
   :options: +ELLIPSIS
-   
+
   NP time: ...[s], theano time: ...[s] (times should be close when run on CPU!)
   Result difference: ...

 .. code-block:: none
-   
+
   NP time: 1.480863[s], theano time: 1.475381[s] (times should be close when run on CPU!)
   Result difference: 0.000000

@@ -445,6 +445,8 @@ routine for matrix multiplication)
 Configure Theano for GPU use
 ############################

+Install `libgpuarray <http://deeplearning.net/software/libgpuarray/installation.html>`_ if you have not already done so.
+
 Theano can be configured with a ``.theanorc`` text file (or
 ``.theanorc.txt``, whichever is easier for you to create under
 Windows). It should be placed in the directory pointed to by the
@@ -457,7 +459,7 @@ To use the GPU please write the following configuration file:
 .. code-block:: cfg

   [global]
-   device = gpu
+   device = cuda
   floatX = float32

   [nvcc]
@@ -498,7 +500,7 @@ within an MSYS shell if you installed Nose manually as described above.
 Compiling a faster BLAS
 ~~~~~~~~~~~~~~~~~~~~~~~

-If you installed Python through WinPython or EPD, Theano will automatically 
+If you installed Python through WinPython or EPD, Theano will automatically
 link with the MKL library, so you should not need to compile your own BLAS.

 .. note::

--- a/doc/library/config.txt
+++ b/doc/library/config.txt
@@ -51,11 +51,11 @@ Environment Variables

    .. code-block:: bash

-        THEANO_FLAGS='floatX=float32,device=gpu0,lib.cnmem=1'  python <myscript>.py
+        THEANO_FLAGS='floatX=float32,device=cuda0,lib.cnmem=1'  python <myscript>.py

    If a value is defined several times in ``THEANO_FLAGS``,
    the right-most definition is used. So, for instance, if
-    ``THEANO_FLAGS='device=cpu,device=gpu0'``, then gpu0 will be used.
+    ``THEANO_FLAGS='device=cpu,device=cuda0'``, then cuda0 will be used.

 .. envvar:: THEANORC

@@ -70,7 +70,7 @@ Environment Variables

        [global]
        floatX = float32
-        device = gpu0
+        device = cuda0

        [lib]
        cnmem = 1
@@ -102,22 +102,21 @@ import theano and print the config variable, as in:

 .. attribute:: device

-    String value: either ``'cpu'``, ``'gpu'``, ``'gpu0'``, ``'gpu1'``,
-    ``'gpu2'``, or ``'gpu3'``
+    String value: either ``'cpu'``, ``'cuda'``, ``'cuda0'``, ``'cuda1'``,
+    ``'opencl0:0'``, ``'opencl0:1'``, ``'gpu'``, ``'gpu0'`` ...

-    Default device for computations. If ``gpu*``, change the default to try
-    to move computation to it and to put shared variable of float32 on
-    it.
-    Choose the default compute device for theano graphs.  Setting this to a
-    ``gpu*`` string will make theano to try by default to move computation to it.
-    Also it will make theano put by default shared variable of float32 on it.
-    ``'gpu'`` lets the driver select the GPU to use, while ``'gpu?'`` makes Theano try
-    to use a specific device. If we are not able to use the GPU, either we fall back
-    on the CPU, or an error is raised, depending on the :attr:`force_device` flag.
+    Default device for computations. If ``'cuda*``, change the default to try
+    to move computation to the GPU using CUDA libraries. If ``'opencl*'``,
+    the openCL libraries will be used. To let the driver select the device,
+    use ``'cuda'`` or ``'opencl'``. If ``'gpu*'``, the old gpu backend will
+    be used, although users are encouraged to migrate to the new GpuArray 
+    backend. If we are not able to use the GPU,
+    either we fall back on the CPU, or an error is raised, depending
+    on the :attr:`force_device` flag.

    This flag's value cannot be modified during the program execution.

-    Do not use upper case letters, only lower case even if NVIDIA use
+    Do not use upper case letters, only lower case even if NVIDIA uses
    capital letters.

 .. attribute:: force_device
@@ -138,11 +137,12 @@ import theano and print the config variable, as in:

 .. attribute:: init_gpu_device

-    String value: either ``''``, ``'gpu'``, ``'gpu0'``, ``'gpu1'``, ``'gpu2'``,
-    or ``'gpu3'``
-
+    String value: either ``''``, ``'cuda'``, ``'cuda0'``, ``'cuda1'``,
+    ``'opencl0:0'``, ``'opencl0:1'``, ``'gpu'``, ``'gpu0'`` ...
+    
    Initialize the gpu device to use.
-    When its value is gpu*, the theano flag :attr:`device` must be ``"cpu"``.
+    When its value is ``'cuda*'``, ``'opencl*'`` or ``'gpu*'``, the theano
+    flag :attr:`device` must be ``'cpu'``.
    Unlike :attr:`device`, setting this flag to a specific GPU will not
    try to use this device by default, in particular it will **not** move
    computations, nor shared variables, to the specified GPU.

--- a/doc/optimizations.txt
+++ b/doc/optimizations.txt
@@ -32,6 +32,7 @@ Optimization                                              FAST_RUN  FAST_COMPILE
 ========================================================= ========= ============ =============
 :term:`merge`                                             x         x
 :term:`constant folding<constant folding>`                x         x
+:term:`GPU transfer`                                      x         x
 :term:`shape promotion<shape promotion>`                  x
 :term:`fill cut<fill cut>`                                x
 :term:`inc_subtensor srlz.<inc_subtensor serialization>`  x
@@ -52,7 +53,6 @@ Optimization                                              FAST_RUN  FAST_COMPILE
 :term:`inplace_elemwise`                                  x
 :term:`inplace_random`                                    x
 :term:`elemwise fusion`                                   x
-:term:`GPU transfer`                                      x
 :term:`local_log_softmax`                                 x                      x
 :term:`local_remove_all_assert`                                                   
 ========================================================= ========= ============ =============

--- a/doc/tutorial/aliasing.txt
+++ b/doc/tutorial/aliasing.txt
@@ -261,52 +261,6 @@ combination of ``return_internal_type=True`` and ``borrow=True`` arguments to
 hints that give more flexibility to the compilation and optimization of the
 graph.

-For GPU graphs, this borrowing can have a major speed impact.  See the following code:
-
-.. code-block:: python
-
-   from theano import function, config, shared, sandbox, tensor, Out
-   import numpy
-   import time
-
-   vlen = 10 * 30 * 768  # 10 x # cores x # threads per core
-   iters = 1000
-
-   rng = numpy.random.RandomState(22)
-   x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
-   f1 = function([], sandbox.cuda.basic_ops.gpu_from_host(tensor.exp(x)))
-   f2 = function([],
-                 Out(sandbox.cuda.basic_ops.gpu_from_host(tensor.exp(x)),
-                     borrow=True))
-   t0 = time.time()
-   for i in range(iters):
-       r = f1()
-   t1 = time.time()
-   no_borrow = t1 - t0
-   t0 = time.time()
-   for i in range(iters):
-       r = f2()
-   t1 = time.time()
-   print(
-       "Looping %s times took %s seconds without borrow "
-       "and %s seconds with borrow" % (iters, no_borrow, (t1 - t0))
-   )
-   if numpy.any([isinstance(x.op, tensor.Elemwise) and
-                 ('Gpu' not in type(x.op).__name__)
-                 for x in f1.maker.fgraph.toposort()]):
-       print('Used the cpu')
-   else:
-       print('Used the gpu')
-
-Which produces this output:
-
-.. code-block:: none
-
-   $ THEANO_FLAGS=device=gpu0,floatX=float32 python test1.py
-   Using gpu device 0: GeForce GTX 275
-   Looping 1000 times took 0.368273973465 seconds without borrow and 0.0240728855133 seconds with borrow.
-   Used the gpu
-
 *Take home message:*

 When an input *x* to a function is not needed after the function
@@ -317,4 +271,3 @@ requirement.  When a return value *y* is large (in terms of memory
 footprint), and you only need to read from it once, right away when
 it's returned, then consider marking it with an ``Out(y,
 borrow=True)``.
-
--- a/doc/tutorial/using_gpu.txt
+++ b/doc/tutorial/using_gpu.txt
--- a/doc/tutorial/using_gpu_solution_1.py
+++ b/doc/tutorial/using_gpu_solution_1.py
--- a/doc/tutorial/using_multi_gpu.txt
+++ b/doc/tutorial/using_multi_gpu.txt
@@ -81,7 +81,7 @@ single name and a single device.
   It is often the case that multi-gpu operation requires or assumes
   that all the GPUs involved are equivalent.  This is not the case
   for this implementation.  Since the user has the task of
-   distrubuting the jobs across the different device a model can be
+   distributing the jobs across the different device a model can be
   built on the assumption that one of the GPU is slower or has
   smaller memory.

@@ -140,5 +140,5 @@ is a example.
   cv = gv.transfer('cpu')

 Of course you can mix transfers and operations in any order you
-choose.  However you should try to minimize transfer operations
-because they will introduce overhead any may reduce performance.
+choose. However you should try to minimize transfer operations
+because they will introduce overhead that may reduce performance.
--- a/theano/configdefaults.py
+++ b/theano/configdefaults.py
@@ -104,10 +104,9 @@ class DeviceParam(ConfigParam):

 AddConfigVar(
    'device',
-    ("Default device for computations. If gpu*, change the default to try "
-     "to move computation to it and to put shared variable of float32 "
-     "on it. Do not use upper case letters, only lower case even if "
-     "NVIDIA use capital letters."),
+    ("Default device for computations. If cuda* or opencl*, change the"
+     "default to try to move computation to the GPU. Do not use upper case"
+     "letters, only lower case even if NVIDIA uses capital letters."),
    DeviceParam('cpu', allow_override=False),
    in_c_key=False)


--- a/theano/misc/check_blas.py
+++ b/theano/misc/check_blas.py
@@ -86,15 +86,20 @@ def execute(execute=True, verbose=True, M=2000, N=2000, K=2000,
    t0 = 0
    t1 = -1

+    f()  # Ignore first function call to get representative time.
    if execute:
        sync = (hasattr(theano, "sandbox") and
                hasattr(theano.sandbox, "cuda") and
                theano.sandbox.cuda.cuda_available)
+        sync2 = (hasattr(theano, "gpuarray") and
+                 theano.gpuarray.pygpu_activated)
        t0 = time.time()
        for i in range(iters):
            f()
        if sync:
            theano.sandbox.cuda.synchronize()
+        if sync2:
+            c.get_value(borrow=True, return_internal_type=True).sync()
        t1 = time.time()
    return t1 - t0, impl


--- a/theano/sandbox/gpuarray/__init__.py
+++ b/theano/sandbox/gpuarray/__init__.py
@@ -4,6 +4,7 @@ which refered to theano.sandbox.gpuarray."""
 import warnings
 from theano.gpuarray import *

-message = "theano.sandbox.gpuarray has been moved to theano.gpuarray." + \
-    " Please update your code and pickles."
+message = ("theano.sandbox.gpuarray has been moved to theano.gpuarray. "
+    "Please update your code and pickles. If the warning persists, "
+    "clear theano's cache ('$theano/bin/theano-cache clear').")
 warnings.warn(message)