Remove tentacles in doc.

78657991 · Arnaud Bergeron · 10432646 · 78657991 · 78657991 · 78657991
--- a/doc/extending/ctype.txt
+++ b/doc/extending/ctype.txt
@@ -527,8 +527,8 @@ You can implement c_code for this op. You register it like this:

 In your C code, you should use %(iname)s and %(oname)s to represent
 the C variable names of the DeepCopyOp input and output
-respectively. See an example for the type ``CudaNdarrayType`` (GPU
-array) in the file `theano/sandbox/cuda/type.py`. The version
+respectively. See an example for the type ``GpuArrayType`` (GPU
+array) in the file `theano/gpuarray/type.py`. The version
 parameter is what is returned by DeepCopyOp.c_code_cache_version(). By
 default, it will recompile the c code for each process.

@@ -549,8 +549,8 @@ calling:

 In your C code, you should use %(iname)s and %(oname)s to represent
 the C variable names of the ViewOp input and output
-respectively. See an example for the type ``CudaNdarrayType`` (GPU
-array) in the file `theano/sandbox/cuda/type.py`. The version
+respectively. See an example for the type ``GpuArrayType`` (GPU
+array) in the file `thean/gpuarray/type.py`. The version
 parameter is what is returned by ViewOp.c_code_cache_version(). By
 default, it will recompile the c code for each process.


--- a/doc/extending/extending_theano.txt
+++ b/doc/extending/extending_theano.txt
@@ -98,7 +98,7 @@ possibilities you may encounter or need.  For that refer to
        def c_code(self, node, inputs, outputs, sub):
            pass

-        # Other implementations (pycuda, ...):
+        # Other implementations:
        def make_thunk(self, node, storage_map, _, _2, impl=None):
            pass

@@ -194,8 +194,7 @@ or :func:`make_thunk`.
      It should have a default value of None.

  :func:`make_thunk` is useful if you want to generate code and compile
-  it yourself. For example, this allows you to use PyCUDA to compile GPU
-  code and keep state in the thunk.
+  it yourself.

  If :func:`make_thunk()` is defined by an op, it will be used by Theano
  to obtain the op's implementation.
@@ -674,14 +673,6 @@ For instance, to verify the Rop method of the DoubleOp, you can use this:
       def test_double_rop(self):
           self.check_rop_lop(DoubleRop()(self.x), self.in_shape)

-Testing GPU Ops
-^^^^^^^^^^^^^^^
-
-When using the old GPU backend, Ops to be executed on the GPU should inherit
-from ``theano.sandbox.cuda.GpuOp`` and not ``theano.Op``. This allows
-Theano to distinguish them. Currently, we use this to test if the
-NVIDIA driver works correctly with our sum reduction code on the GPU.
-
 Running Your Tests
 ^^^^^^^^^^^^^^^^^^


--- a/doc/extending/extending_theano_gpu.txt
+++ b/doc/extending/extending_theano_gpu.txt
@@ -309,5 +309,5 @@ As long as the computations happen on the NULL stream there are no
 special considerations to watch for with regards to synchronization.
 Otherwise, you will have to make sure that you synchronize the pygpu
 objects by calling the `.sync()` method before scheduling any work and
-synchronize with the work that happends in the library after all the
+synchronize with the work that happens in the library after all the
 work is scheduled.
--- a/doc/faq.txt
+++ b/doc/faq.txt
@@ -51,28 +51,29 @@ optimizations and disables the generation of any c/cuda code. This is useful
 for quickly testing a simple idea.

 If c/cuda code is necessary, as when using a GPU, the flag
-``optimizer=fast_compile`` can be used instead. It instructs Theano to skip time
-consuming optimizations but still generate c/cuda code. To get the most out of
-this flag requires using a development version of Theano instead of the latest
-release (0.6).
+``optimizer=fast_compile`` can be used instead. It instructs Theano to
+skip time consuming optimizations but still generate c/cuda code.

 Similarly using the flag ``optimizer_excluding=inplace`` will speed up
-compilation by preventing optimizations that replace operations with a version
-that reuses memory where it will not negatively impact the integrity of the
-operation. Such optimizations can be time consuming. However using this flag will
-result in greater memory usage because space must be allocated for the results
-which would be unnecessary otherwise. In short, using this flag will speed up
+compilation by preventing optimizations that replace operations with a
+version that reuses memory where it will not negatively impact the
+integrity of the operation. Such optimizations can be time
+consuming. However using this flag will result in greater memory usage
+because space must be allocated for the results which would be
+unnecessary otherwise. In short, using this flag will speed up
 compilation but it will also use more memory because
-``optimizer_excluding=inplace`` excludes inplace optimizations resulting
-in a trade off between speed of compilation and memory usage.
-
-Theano flag `reoptimize_unpickled_function` controls if an unpickled theano function
-should reoptimize its graph or not. Theano users can use the standard python pickle
-tools to save a compiled theano function. When pickling, both graph before and
-after the optimization are saved, including shared variables. When set to True,
-the graph is reoptimized when being unpickled. Otherwise, skip the graph optimization
-and use directly the optimized graph from the pickled file. After Theano 0.7,
-the default changed to False.
+``optimizer_excluding=inplace`` excludes inplace optimizations
+resulting in a trade off between speed of compilation and memory
+usage.
+
+Theano flag `reoptimize_unpickled_function` controls if an unpickled
+theano function should reoptimize its graph or not. Theano users can
+use the standard python pickle tools to save a compiled theano
+function. When pickling, both graph before and after the optimization
+are saved, including shared variables. When set to True, the graph is
+reoptimized when being unpickled. Otherwise, skip the graph
+optimization and use directly the optimized graph from the pickled
+file. The default is False.

 Faster Theano function
 ----------------------

--- a/doc/index.txt
+++ b/doc/index.txt
@@ -21,6 +21,9 @@ learning/machine learning <https://mila.umontreal.ca/en/cours/>`_ classes).
 News
 ====

+* Removed support for the old (device=gpu) backend.  Use the new
+  backend (device=cuda) for gpu computing.
+
 * 2017/03/20: Release of Theano 0.9.0. Everybody is encouraged to update.

 * 2017/03/13: Release of Theano 0.9.0rc4, with crash fixes and bug fixes.
@@ -37,7 +40,7 @@ News
  `Theano: A Python framework for fast computation of mathematical expressions <http://arxiv.org/abs/1605.02688>`_.
  This is the new preferred reference.

-* 2016/04/21: Release of Theano 0.8.2, adding support for :ref:`CuDNN v5 <libdoc_cuda_dnn>`.
+* 2016/04/21: Release of Theano 0.8.2, adding support for CuDNN v5.

 * 2016/03/29: Release of Theano 0.8.1, fixing a compilation issue on MacOS X with XCode 7.3.

@@ -45,12 +48,11 @@ News

 * Multi-GPU.

-* We added support for :attr:`CNMeM <config.lib.cnmem>` to speed up
-  the GPU memory allocation.
+* We added support for CNMeM to speed up the GPU memory allocation.

 * Theano 0.7 was released 26th March 2015. Everybody is encouraged to update.

-* We support `cuDNN <http://deeplearning.net/software/theano/library/sandbox/cuda/dnn.html>`_ if it is installed by the user.
+* We support cuDNN if it is installed by the user.

 * Open Machine Learning Workshop 2014 `presentation <omlw2014/omlw_presentation.pdf>`_.


--- a/doc/install_ubuntu.txt
+++ b/doc/install_ubuntu.txt
@@ -24,8 +24,8 @@ Ubuntu Installation Instructions
 Prerequisites through System Packages (not recommended)
 -------------------------------------------------------

-If you want to acquire the requirements through your system packages and install
-them system wide follow these instructions:
+If you want to acquire the requirements through your system packages
+and install them system wide follow these instructions:

 For Ubuntu 16.04 with cuda 7.5

@@ -49,9 +49,6 @@ For Ubuntu 16.04 with cuda 7.5
    sudo update-alternatives --install /usr/bin/c++ c++ /usr/bin/g++ 30
    sudo update-alternatives --set c++ /usr/bin/g++

-    # Work around a glibc bug
-    echo -e "\n[nvcc]\nflags=-D_FORCE_INLINES\n" >> ~/.theanorc
-
 For Ubuntu 11.10 through 14.04:

 .. code-block:: bash

--- a/doc/install_windows.txt
+++ b/doc/install_windows.txt
@@ -54,8 +54,8 @@ You must reboot the computer after the driver installation.
 Instructions for other Python distributions (not recommended)
 =============================================================

-If you plan to use Theano with other Python distributions, these are generic guidelines to get
-a working environment:
+If you plan to use Theano with other Python distributions, these are
+generic guidelines to get a working environment:

    * Look for the mandatory requirements in the package manager's repositories of your distribution. Many
      distributions come with ``pip`` package manager which use `PyPI repository <https://pypi.python.org/pypi>`__.

--- a/doc/internal/release.txt
+++ b/doc/internal/release.txt
@@ -9,8 +9,7 @@ out Theano easy. You can install a stable version of Theano, without having to
 worry about the current state of the repository.  While we usually try NOT to
 break the trunk, mistakes can happen. This also greatly simplifies the
 installation process: mercurial is no longer required and certain python
-dependencies can be handled automatically (numpy for now, maybe pycuda, cython
-later).
+dependencies can be handled automatically (numpy for now, cython later).

 The Theano release plan is detailed below. Comments and/or suggestions are
 welcome on the mailing list.

--- a/doc/library/config.txt
+++ b/doc/library/config.txt
@@ -51,7 +51,7 @@ Environment Variables

    .. code-block:: bash

-        THEANO_FLAGS='floatX=float32,device=cuda0,lib.cnmem=1'  python <myscript>.py
+        THEANO_FLAGS='floatX=float32,device=cuda0,gpuarray.preallocate=1'  python <myscript>.py

    If a value is defined several times in ``THEANO_FLAGS``,
    the right-most definition is used. So, for instance, if
@@ -72,15 +72,15 @@ Environment Variables
        floatX = float32
        device = cuda0

-        [lib]
-        cnmem = 1
+        [gpuarray]
+        preallocate = 1

    Configuration attributes that are available directly in ``config``
    (e.g. ``config.device``, ``config.mode``) should be defined in the
    ``[global]`` section.
-    Attributes from a subsection of ``config`` (e.g. ``config.lib.cnmem``,
+    Attributes from a subsection of ``config`` (e.g. ``config.gpuarray.preallocate``,
    ``config.dnn.conv.algo_fwd``) should be defined in their corresponding
-    section (e.g. ``[nvcc]``, ``[dnn.conv]``).
+    section (e.g. ``[gpuarray]``, ``[dnn.conv]``).

    Multiple configuration files can be specified by separating them with ':'
    characters (as in $PATH).  Multiple configuration files will be merged,
@@ -105,14 +105,12 @@ import theano and print the config variable, as in:
 .. attribute:: device

    String value: either ``'cpu'``, ``'cuda'``, ``'cuda0'``, ``'cuda1'``,
-    ``'opencl0:0'``, ``'opencl0:1'``, ``'gpu'``, ``'gpu0'`` ...
+    ``'opencl0:0'``, ``'opencl0:1'``, ...

    Default device for computations. If ``'cuda*``, change the default to try
    to move computation to the GPU using CUDA libraries. If ``'opencl*'``,
    the openCL libraries will be used. To let the driver select the device,
-    use ``'cuda'`` or ``'opencl'``. If ``'gpu*'``, the old gpu backend will
-    be used, although users are encouraged to migrate to the new GpuArray 
-    backend. If we are not able to use the GPU,
+    use ``'cuda'`` or ``'opencl'``. If we are not able to use the GPU,
    either we fall back on the CPU, or an error is raised, depending
    on the :attr:`force_device` flag.

@@ -140,10 +138,10 @@ import theano and print the config variable, as in:
 .. attribute:: init_gpu_device

    String value: either ``''``, ``'cuda'``, ``'cuda0'``, ``'cuda1'``,
-    ``'opencl0:0'``, ``'opencl0:1'``, ``'gpu'``, ``'gpu0'`` ...
+    ``'opencl0:0'``, ``'opencl0:1'``, ...

    Initialize the gpu device to use.
-    When its value is ``'cuda*'``, ``'opencl*'`` or ``'gpu*'``, the theano
+    When its value is ``'cuda*'`` or ``'opencl*'``, the theano
    flag :attr:`device` must be ``'cpu'``.
    Unlike :attr:`device`, setting this flag to a specific GPU will not
    try to use this device by default, in particular it will **not** move
@@ -154,20 +152,6 @@ import theano and print the config variable, as in:

    This flag's value cannot be modified during the program execution.

-.. attribute:: config.pycuda.init
-
-    Bool value: either ``True`` or ``False``
-
-    Default: ``False``
-
-    If True, always initialize PyCUDA when Theano want to initialize
-    the GPU.  With PyCUDA version 2011.2.2 or earlier, PyCUDA must
-    initialize the GPU before Theano does it.  Setting
-    this flag to True, ensure that, but always import PyCUDA.  It can
-    be done manually by importing ``theano.misc.pycuda_init`` before
-    Theano initialize the GPU device.  Newer version of PyCUDA
-    (currently only in the trunk) don't have this restriction.
-
 .. attribute:: print_active_device

    Bool value: either ``True`` or ``False``
@@ -176,14 +160,6 @@ import theano and print the config variable, as in:

    Print active device at when the GPU device is initialized.

-.. attribute:: enable_initial_driver_test
-
-    Bool value: either ``True`` or ``False``
-
-    Default: ``True``
-
-    Tests the nvidia driver when a GPU device is initialized.
-
 .. attribute:: floatX

    String value: ``'float64'``, ``'float32'``, or ``'float16'`` (with limited support)
@@ -455,48 +431,6 @@ import theano and print the config variable, as in:
        automatically to get more memory. But this can cause
        fragmentation, see note above.

-.. attribute:: config.lib.cnmem
-
-    .. note::
-
-        This value allocates GPU memory ONLY when using (:ref:`cuda`)
-        and has no effect when the GPU backend is (:ref:`gpuarray`).
-        For the new backend, please see ``config.gpuarray.preallocate``
-
-    Float value: >= 0
-
-    Controls the use of `CNMeM <https://github.com/NVIDIA/cnmem>`_ (a
-    faster CUDA memory allocator). Applies to the old GPU backend 
-    :ref:`cuda` up to Theano release 0.8.
-
-    The CNMeM library is included in Theano and does not need to be
-    separately installed.
-
-    The value represents the start size (either in MB or the fraction of total GPU
-    memory) of the memory pool. If more memory is needed, Theano will
-    try to obtain more, but this can cause memory fragmentation.
-
-        * 0: not enabled.
-        * 0 < N <= 1: use this fraction of the total GPU memory (clipped to .95 for driver memory).
-        * > 1: use this number in megabytes (MB) of memory.
-
-
-    Default: 0
-
-    .. note::
-
-        This could cause memory fragmentation. So if you have a
-        memory error while using CNMeM, try to allocate more memory at
-        the start or disable it. If you try this, report your result
-        on :ref`theano-dev`.
-
-    .. note::
-
-        The clipping at 95% can be bypassed by specifing the exact
-        number of megabytes. If more then 95% are needed, it will try
-        automatically to get more memory. But this can cause
-        fragmentation, see note above.
-
 .. attribute:: config.gpuarray.sched

    String value: ``'default'``, ``'multi'``, ``'single'``
@@ -664,20 +598,6 @@ import theano and print the config variable, as in:
    As such this optimization does not always introduce an assert in the graph.
    Removing the assert could speed up execution.

-.. attribute:: config.cuda.root
-
-    Default: $CUDA_ROOT or failing that, ``"/usr/local/cuda"``
-
-    A directory with bin/, lib/, include/ folders containing cuda utilities.
-
-.. attribute:: config.cuda.enabled
-
-    Bool value: either ``True`` of ``False``
-
-    Default: ``True``
-
-    If set to `False`, C code in old backend is not compiled.
-
 .. attribute:: config.dnn.enabled

    String value: ``'auto'``, ``'True'``, ``'False'``
@@ -774,19 +694,6 @@ import theano and print the config variable, as in:
    This can be any compiler binary (full path or not) but things may
    break if the interface is not g++-compatible to some degree.

-.. attribute:: config.nvcc.fastmath
-
-    Bool value, default: ``False``
-
-    If true, this will enable fastmath (|use_fast_math|_)
-    mode for compiled cuda code which makes div and sqrt faster at the
-    cost of precision.  This also disables support for denormal
-    numbers. This can cause NaN. So if you have NaN and use this flag,
-    try to disable it.
-
-    .. |use_fast_math| replace:: ``--use-fast-math``
-    .. _use_fast_math: http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#options-for-steering-cuda-compilation
-
 .. attribute:: config.optimizer_excluding

    Default: ``""``

--- a/doc/library/sandbox/cuda/dnn.txt
+++ b/doc/library/sandbox/cuda/dnn.txt
-.. _libdoc_cuda_dnn:
-
-=======================================
-:mod:`theano.sandbox.cuda.dnn` -- cuDNN
-=======================================
-
-.. moduleauthor:: LISA
-
-`cuDNN <https://developer.nvidia.com/cuDNN>`_ is an NVIDIA library with
-functionality used by deep neural network. It provides optimized versions
-of some operations like the convolution. cuDNN is not currently
-installed with CUDA. You must download and install it
-yourself.
-
-To install it, decompress the downloaded file and make the ``*.h`` and
-``*.so*`` files available to the compilation environment.
-There are at least three possible ways of doing so:
-
- The easiest is to include them in your CUDA installation. Copy the
-  ``*.h`` files to ``CUDA_ROOT/include`` and the ``*.so*`` files to
-  ``CUDA_ROOT/lib64`` (by default, ``CUDA_ROOT`` is ``/usr/local/cuda``
-  on Linux).
- Alternatively, on Linux, you can set the environment variables
-  ``LD_LIBRARY_PATH``, ``LIBRARY_PATH`` and ``CPATH`` to the directory
-  extracted from the download. If needed, separate multiple directories
-  with ``:`` as in the ``PATH`` environment variable.
-
-  example::
-
-      export LD_LIBRARY_PATH=/home/user/path_to_CUDNN_folder/lib64:$LD_LIBRARY_PATH
-      export CPATH=/home/user/path_to_CUDNN_folder/include:$CPATH
-      export LIBRARY_PATH=/home/user/path_to_CUDNN_folder/lib64:$LIBRARY_PATH
-
- And as a third way, also on Linux, you can copy the ``*.h`` files
-  to ``/usr/include`` and the ``*.so*`` files to ``/lib64``.
-
-By default, Theano will detect if it can use cuDNN. If so, it will use
-it.  If not, Theano optimizations will not introduce cuDNN ops. So
-Theano will still work if the user did not introduce them manually.
-
-The recently added Theano flag :attr:`dnn.enabled
-<config.dnn.enabled>` allows to change the default behavior to force
-it or disable it. Older Theano version do not support this flag. To
-get an error when cuDNN can not be used with them, use this flag:
-``optimizer_including=cudnn``.
-
-.. note::
-
-   cuDNN v5.1 is supported in Theano master version. So it dropped cuDNN v3 support.
-   Theano 0.8.0 and 0.8.1 support only cuDNN v3 and v4.
-   Theano 0.8.2 will support only v4 and v5.
-
-.. note::
-
-   Starting in cuDNN v3, multiple convolution implementations are offered and
-   it is possible to use heuristics to automatically choose a convolution
-   implementation well suited to the parameters of the convolution.
-
-   The Theano flag ``dnn.conv.algo_fwd`` allows to specify the cuDNN
-   convolution implementation that Theano should use for forward convolutions.
-   Possible values include :
-
-   * ``small`` (default) : use a convolution implementation with small memory
-     usage
-   * ``none`` : use a slower implementation with minimal memory usage
-   * ``large`` : use a sometimes faster implementation with large memory usage
-   * ``fft`` : use the Fast Fourier Transform implementation of convolution
-     (very high memory usage)
-   * ``fft_tiling`` : use the Fast Fourier Transform implementation of convolution
-     with tiling (high memory usage, but less then fft)
-   * ``guess_once`` : the first time a convolution is executed, the
-     implementation to use is chosen according to cuDNN's heuristics and reused
-     for every subsequent execution of the convolution.
-   * ``guess_on_shape_change`` : like ``guess_once`` but a new convolution
-     implementation selected every time the shapes of the inputs and kernels
-     don't match the shapes from the last execution.
-   * ``time_once`` : the first time a convolution is executed, every convolution
-     implementation offered by cuDNN is executed and timed. The fastest is
-     reused for every subsequent execution of the convolution.
-   * ``time_on_shape_change`` : like ``time_once`` but a new convolution
-     implementation selected every time the shapes of the inputs and kernels
-     don't match the shapes from the last execution.
-
-   The Theano flag ``dnn.conv.algo_bwd_filter`` and
-   ``dnn.conv.algo_bwd_data`` allows to specify the cuDNN
-   convolution implementation that Theano should use for gradient
-   convolutions.  Possible values include :
-
-   * ``none`` (default) : use the default non-deterministic convolution
-     implementation
-   * ``deterministic`` : use a slower but deterministic implementation
-   * ``fft`` : use the Fast Fourier Transform implementation of convolution
-     (very high memory usage)
-   * ``guess_once`` : the first time a convolution is executed, the
-     implementation to use is chosen according to cuDNN's heuristics and reused
-     for every subsequent execution of the convolution.
-   * ``guess_on_shape_change`` : like ``guess_once`` but a new convolution
-     implementation selected every time the shapes of the inputs and kernels
-     don't match the shapes from the last execution.
-   * ``time_once`` : the first time a convolution is executed, every convolution
-     implementation offered by cuDNN is executed and timed. The fastest is
-     reused for every subsequent execution of the convolution.
-   * ``time_on_shape_change`` : like ``time_once`` but a new convolution
-     implementation selected every time the shapes of the inputs and kernels
-     don't match the shapes from the last execution.
-
-   * (algo_bwd_data only) ``fft_tiling`` : use the Fast Fourier
-     Transform implementation of convolution with tiling (high memory
-     usage, but less then fft)
-
-   * (algo_bwd_data only) ``small`` : use a convolution implementation
-     with small memory usage
-
-   ``guess_*`` and ``time_*`` flag values take into account the amount of
-   available memory when selecting an implementation. This means that slower
-   implementations might be selected if not enough memory is available for the
-   faster implementations.
-
-.. note::
-
-    Normally you should not call GPU Ops directly, but the CPU interface
-    currently does not allow all options supported by cuDNN ops. So it is
-    possible that you will need to call them manually.
-
-.. note::
-
-    The documentation of CUDNN tells that, for the 2 following operations, the
-    reproducibility is not guaranteed with the default implementation:
-    `cudnnConvolutionBackwardFilter` and `cudnnConvolutionBackwardData`.
-    Those correspond to the gradient wrt the weights and the gradient wrt the
-    input of the convolution. They are also used sometimes in the forward
-    pass, when they give a speed up.
-
-    The Theano flag ``dnn.conv.algo_bwd`` can be use to force the use of a
-    slower but deterministic convolution implementation.
-
-.. note::
-
-    There is a problem we do not understand yet when cudnn paths are
-    used with symbolic links. So avoid using that.
-
-.. note::
-
-    cudnn.so* must be readable and executable by everybody.
-    cudnn.h must be readable by everybody.
-
-
- Convolution:
-    - :func:`theano.sandbox.cuda.dnn.dnn_conv`, :func:`theano.sandbox.cuda.dnn.dnn_conv3d`.
-    - :func:`theano.sandbox.cuda.dnn.dnn_gradweight`.
-    - :func:`theano.sandbox.cuda.dnn.dnn_gradinput`.
- Pooling:
-    - :func:`theano.sandbox.cuda.dnn.dnn_pool`.
- Batch Normalization:
-    - :func:`theano.sandbox.cuda.dnn.dnn_batch_normalization_train`
-    - :func:`theano.sandbox.cuda.dnn.dnn_batch_normalization_test`.
- RNN:
-    - :class:`New back-end only! <theano.gpuarray.dnn.RNNBlock>`.
- Softmax:
-    - You can manually use the op :class:`GpuDnnSoftmax
-      <theano.sandbox.cuda.dnn.GpuDnnSoftmax>` to use its extra feature.
-
-
-List of Implemented Operations
-==============================
-
-.. automodule:: theano.sandbox.cuda.dnn
-   :members:
--- a/doc/library/sandbox/cuda/index.txt
+++ b/doc/library/sandbox/cuda/index.txt
-
-.. _libdoc_sandbox_cuda:
-
-===========================================
-:mod:`sandbox.cuda` -- The CUDA GPU backend
-===========================================
-
-.. module:: sandbox.cuda
-   :platform: Unix, Windows
-   :synopsis: Code for GPU programming
-.. moduleauthor:: LISA
-
-.. toctree::
-    :maxdepth: 1
-
-    op
-    var
-    type
-    dnn
--- a/doc/library/sandbox/cuda/op.txt
+++ b/doc/library/sandbox/cuda/op.txt
-
-.. _libdoc_cuda_op:
-
-======================================================
-:mod:`sandbox.cuda` -- List of CUDA GPU Op implemented
-======================================================
-
-.. moduleauthor:: LISA
-
-Normally you should not call directly those Ops! Theano should automatically transform cpu ops to their gpu equivalent. So this list is just useful to let people know what is implemented on the gpu.
-
-Basic Op
-========
-
-.. automodule:: theano.sandbox.cuda.basic_ops
-    :members:
-
-Blas Op
-=======
-
-.. automodule:: theano.sandbox.cuda.blas
-    :members:
-.. autoclass:: theano.sandbox.cuda.blas.GpuBatchedDot
-
-Nnet Op
-=======
-
-.. automodule:: theano.sandbox.cuda.nnet
-    :members:
-
-Curand Op
-=========
-
-Random generator based on the CURAND libraries. It is not inserted automatically.
-
-.. automodule:: theano.sandbox.cuda.rng_curand
-    :members:
--- a/doc/library/sandbox/cuda/type.txt
+++ b/doc/library/sandbox/cuda/type.txt
-..  ../../../../theano/sandbox/cuda/type.py
-..  ../../../../theano/sandbox/cuda/var.py
-..  ../../../../theano/sandbox/cuda/
-
-.. _libdoc_cuda_type:
-
-======================================================================
-:mod:`sandbox.cuda.type` --  The Type object for Cuda-allocated arrays
-======================================================================
-
-.. module:: sandbox.cuda.type
-   :platform: Unix, Windows
-   :synopsis: The Type object for CUDA-allocated arrays
-.. moduleauthor:: LISA
-
-API
-===
-
-
--- a/doc/library/sandbox/cuda/var.txt
+++ b/doc/library/sandbox/cuda/var.txt
-..  ../../../../theano/sandbox/cuda/type.py
-..  ../../../../theano/sandbox/cuda/var.py
-..  ../../../../theano/sandbox/cuda/
-
-.. _libdoc_cuda_var:
-
-===================================================================
-:mod:`sandbox.cuda.var` --  The Variables for Cuda-allocated arrays
-===================================================================
-
-.. module:: sandbox.cuda.var
-   :platform: Unix, Windows
-   :synopsis: The Variables object for CUDA-allocated arrays
-.. moduleauthor:: LISA
-
-API
-===
-
-.. autoclass:: theano.sandbox.cuda.var.CudaNdarraySharedVariable
-    :members: get_value, set_value
-
--- a/doc/library/sandbox/index.txt
+++ b/doc/library/sandbox/index.txt
@@ -13,7 +13,6 @@
 .. toctree::
    :maxdepth: 1

-    cuda/index
    linalg
    neighbours
    rng_mrg
--- a/doc/library/tensor/nnet/conv.txt
+++ b/doc/library/tensor/nnet/conv.txt
@@ -40,7 +40,7 @@
    computations in the un-optimized graph, and cause problems with DebugMode,
    test values, and when compiling with optimizer=None.

-    By default, if :ref:`cuDNN <libdoc_cuda_dnn>`
+    By default, if :ref:`cuDNN <libdoc_gpuarray_dnn>`
    is available, we will use it, otherwise we will fall back to using the
    gemm version (slower than cuDNN in most cases and uses more memory).

@@ -70,61 +70,27 @@ TODO: Give examples on how to use these things! They are pretty complicated.

 - Implemented operators for neural network 2D / image convolution:
    - :func:`nnet.conv.conv2d <theano.tensor.nnet.conv.conv2d>`.
-      CPU convolution implementation, previously used as the convolution interface.
-      This is the standard operator for convolutional neural networks working
-      with batches of multi-channel 2D images, available. It
-      computes a convolution, i.e., it flips the kernel.
+      CPU convolution implementation, previously used as the
+      convolution interface.  This is the standard operator for
+      convolutional neural networks working with batches of
+      multi-channel 2D images, available. It computes a convolution,
+      i.e., it flips the kernel.
      Most of the more efficient GPU implementations listed below can be
      inserted automatically as a replacement for nnet.conv.conv2d via graph
      optimizations. Some of these graph optimizations are enabled by default,
      others can be enabled via Theano flags.
-      Since November 24th, 2014, you can also use a meta-optimizer to
-      automatically choose the fastest implementation for each specific
-      convolution in your graph using the old interface. For each instance,
-      it will compile and benchmark each applicable implementation of the ones
+      You can also use a meta-optimizer to automatically choose the
+      fastest implementation for each specific convolution in your
+      graph using the old interface. For each instance, it will
+      compile and benchmark each applicable implementation of the ones
      listed below and choose the fastest one.
      As performance is dependent on input and filter shapes, this
-      only works for operations introduced via nnet.conv.conv2d with fully specified
-      shape information.
-      Enable it via the Theano flag ``optimizer_including=conv_meta``, and
-      optionally set it to verbose mode via the flag `metaopt.verbose=1`.
-    - :func:`conv2d_fft <theano.sandbox.cuda.fftconv.conv2d_fft>` This
-      is a GPU-only version of nnet.conv2d that uses an FFT transform
-      to perform the work.  It flips the kernel just like ``conv2d``.
-      conv2d_fft should not be used directly as
-      it does not provide a gradient. Instead, use nnet.conv2d and
-      allow Theano's graph optimizer to replace it by the FFT version
-      by setting 'THEANO_FLAGS=optimizer_including=conv_fft'
-      in your environment. If enabled, it will take precedence over cuDNN
-      and the gemm version.  It is not enabled by default because it
-      has some restrictions on input and uses a lot more memory.  Also
-      note that it requires CUDA >= 5.0, scikits.cuda >= 0.5.0 and
-      PyCUDA to run.  To deactivate the FFT optimization on a specific
-      nnet.conv2d while the optimization flag is active, you can set
-      its ``version`` parameter to ``'no_fft'``. To enable it for just
-      one Theano function:
-
-
-      .. code-block:: python
-
-          mode = theano.compile.get_default_mode()
-          mode = mode.including('conv_fft')
-
-          f = theano.function(..., mode=mode)
-
-    - `cuda-convnet wrapper for 2d correlation <http://deeplearning.net/software/pylearn2/library/alex.html>`_
-
-      Wrapper for an open-source GPU-only implementation of conv2d by Alex
-      Krizhevsky, very fast, but with several restrictions on input and kernel
-      shapes, and with a different memory layout for the input. It does not
-      flip the kernel.
-
-      This is in Pylearn2, where it is normally called from the `linear transform
-      <http://deeplearning.net/software/pylearn2/library/linear.html>`_
-      implementation, but it can also be used `directly from within Theano
-      <http://benanne.github.io/2014/04/03/faster-convolutions-in-theano.html>`_
-      as a manual replacement for nnet.conv2d.
-    - :func:`GpuCorrMM <theano.sandbox.cuda.blas.GpuCorrMM>`
+      only works for operations introduced via nnet.conv.conv2d with
+      fully specified shape information.  Enable it via the Theano
+      flag ``optimizer_including=conv_meta``, and optionally set it to
+      verbose mode via the flag `metaopt.verbose=1`.
+
+    - :func:`GpuCorrMM <theano.gpuarray.blas.GpuCorrMM>`
      This is a GPU-only 2d correlation implementation taken from
      `caffe's CUDA implementation <https://github.com/BVLC/caffe/blob/master/src/caffe/layers/conv_layer.cu>`_
      and also used by Torch. It does not flip the kernel.
@@ -149,7 +115,7 @@ TODO: Give examples on how to use these things! They are pretty complicated.
      you can use it as a replacement for nnet.conv2d. For convolutions done on
      CPU, nnet.conv2d will be replaced by CorrMM. To explicitly disable it, set
      ``THEANO_FLAGS=optimizer_excluding=conv_gemm`` in your environment.
-    - :func:`dnn_conv <theano.sandbox.cuda.dnn.dnn_conv>` GPU-only
+    - :func:`dnn_conv <theano.gpuarray.dnn.dnn_conv>` GPU-only
      convolution using NVIDIA's cuDNN library. This requires that you have
      cuDNN 4.0 or newer installed and available, which in turn requires CUDA 7.0
      and a GPU with compute capability 3.0 or more.
@@ -162,25 +128,7 @@ TODO: Give examples on how to use these things! They are pretty complicated.
    - :func:`conv3D <theano.tensor.nnet.Conv3D.conv3D>`
      3D Convolution applying multi-channel 3D filters to batches of
      multi-channel 3D images. It does not flip the kernel.
-    - :func:`conv3d_fft <theano.sandbox.cuda.fftconv.conv3d_fft>`
-      GPU-only version of conv3D using FFT transform. conv3d_fft should
-      not be called directly as it does not provide a gradient.
-      Instead, use conv3D and allow Theano's graph optimizer to replace it by
-      the FFT version by setting
-      ``THEANO_FLAGS=optimizer_including=conv3d_fft:convgrad3d_fft:convtransp3d_fft``
-      in your environment. This is not enabled by default because it does not
-      support strides and uses more memory. Also note that it requires
-      CUDA >= 5.0, scikits.cuda >= 0.5.0 and PyCUDA to run.
-      To enable for just one Theano function:
-
-      .. code-block:: python
-
-          mode = theano.compile.get_default_mode()
-          mode = mode.including('conv3d_fft', 'convgrad3d_fft', 'convtransp3d_fft')
-
-          f = theano.function(..., mode=mode)
-
-    - :func:`GpuCorr3dMM <theano.sandbox.cuda.blas.GpuCorr3dMM>`
+    - :func:`GpuCorr3dMM <theano.gpuarray.blas.GpuCorr3dMM>`
      This is a GPU-only 3d correlation relying on a Toeplitz matrix
      and gemm implementation (see :func:`GpuCorrMM <theano.sandbox.cuda.blas.GpuCorrMM>`)
      It needs extra memory for the Toeplitz matrix, which is a 2D matrix of shape
@@ -203,27 +151,24 @@ TODO: Give examples on how to use these things! They are pretty complicated.
      nnet.conv3d will be replaced by Corr3dMM. To explicitly disable it, set
      ``THEANO_FLAGS=optimizer_excluding=conv_gemm`` in your environment.

-    - :func:`dnn_conv3d <theano.sandbox.cuda.dnn.dnn_conv3d>` GPU-only
+    - :func:`dnn_conv <theano.gpuarray.dnn.dnn_conv>` GPU-only
      convolution using NVIDIA's cuDNN library. This requires that you have
      cuDNN 4.0 or newer installed and available, which in turn requires CUDA 7.0
      and a GPU with compute capability 3.0 or more.

      If cuDNN is available, by default, Theano will replace all nnet.conv3d
-      operations with dnn_conv3d. To explicitly disable it, set
+      operations with dnn_conv. To explicitly disable it, set
      ``THEANO_FLAGS=optimizer_excluding=conv_dnn`` in your environment.
      As dnn_conv3d has a gradient defined, you can also use it manually.

    - :func:`conv3d2d <theano.tensor.nnet.conv3d2d.conv3d>`
      Another conv3d implementation that uses the conv2d with data reshaping.
-      It is faster in some cases than conv3d, and work on the GPU.
-      It flip the kernel.
+      It is faster in some cases than conv3d. It flips the kernel.

 .. autofunction:: theano.tensor.nnet.conv2d
 .. autofunction:: theano.tensor.nnet.conv2d_transpose
 .. autofunction:: theano.tensor.nnet.conv3d
-.. autofunction:: theano.sandbox.cuda.fftconv.conv2d_fft
 .. autofunction:: theano.tensor.nnet.Conv3D.conv3D
-.. autofunction:: theano.sandbox.cuda.fftconv.conv3d_fft
 .. autofunction:: theano.tensor.nnet.conv3d2d.conv3d
 .. autofunction:: theano.tensor.nnet.conv.conv2d


--- a/doc/requirements.inc
+++ b/doc/requirements.inc
@@ -107,14 +107,3 @@ Install and configure the GPU drivers (recommended)
    * Add the 'lib' subdirectory (and/or 'lib64' subdirectory if you have a
      64-bit OS) to your ``$LD_LIBRARY_PATH`` environment 
      variable.
-
-3. Set Theano's config flags
-
-    To use the GPU you need to define the *cuda root*. You can do it in one 
-    of the following ways:
-
-    * Define a $CUDA_ROOT environment variable to equal the cuda root directory, as in ``CUDA_ROOT=/path/to/cuda/root``, or
-    * add a ``cuda.root`` flag to :envvar:`THEANO_FLAGS`, as in ``THEANO_FLAGS='cuda.root=/path/to/cuda/root'``, or
-    * add a [cuda] section to your .theanorc file containing the option ``root = /path/to/cuda/root``.
-
-
--- a/doc/troubleshooting.txt
+++ b/doc/troubleshooting.txt
@@ -120,7 +120,7 @@ some test fails on your machine, you are encouraged to tell us what went
 wrong on the ``theano-users@googlegroups.com`` mailing list.

 .. warning::
-    Theano's test should **NOT** be run with ``device=cuda`` or ``device=gpu``
+    Theano's test should **NOT** be run with ``device=cuda``
    or they will fail. The tests automatically use the gpu, if any, when
    needed. If you don't want Theano to ever use the gpu when running tests,
    you can set :attr:`config.device` to ``cpu`` and
@@ -137,24 +137,22 @@ CPU and GPU memory usage.

 Could speed up and lower memory usage:

- :ref:`cuDNN <libdoc_cuda_dnn>` default cuDNN convolution use less
+- :ref:`cuDNN <libdoc_gpuarray_dnn>` default cuDNN convolution use less
   memory then Theano version. But some flags allow it to use more
   memory. GPU only.
- Shortly avail, multi-GPU.

 Could raise memory usage but speed up computation:

- :attr:`config.gpuarray.preallocate` =1  # Preallocates the GPU memory for the new backend(:ref:`gpuarray`) 
-  and then manages it in a smart way. Does not raise much the memory usage, but if
-  you are at the limit of GPU memory available you might need to specify a
-  lower value. GPU only.
- :attr:`config.lib.cnmem` =1  # Equivalent on the old backend (:ref:`cuda`). GPU only.
+- :attr:`config.gpuarray.preallocate` = 1  # Preallocates the GPU memory
+  and then manages it in a smart way. Does not raise much the memory
+  usage, but if you are at the limit of GPU memory available you might
+  need to specify a lower value. GPU only.
 - :attr:`config.allow_gc` =False
 - :attr:`config.optimizer_excluding` =low_memory , GPU only for now.

 Could lower the memory usage, but raise computation time:

- :attr:`config.scan.allow_gc` =True # Probably not significant slowdown if config.lib.cnmem is used.
+- :attr:`config.scan.allow_gc` = True # Probably not significant slowdown on the GPU if memory cache is not disabled
 - :attr:`config.scan.allow_output_prealloc` =False
 - Use :func:`batch_normalization()
  <theano.tensor.nnet.bn.batch_normalization>`. It use less memory

--- a/doc/tutorial/aliasing.txt
+++ b/doc/tutorial/aliasing.txt
@@ -211,15 +211,16 @@ be costly.  Here are a few tips to ensure fast and efficient use of GPU memory a
  *Solution*: upgrade to a recent version of Theano (>0.3.0) and consider padding your source
  data to make sure that every chunk is the same size.

-* It is also worth mentioning that, current GPU copying routines support only contiguous memory.
-  So Theano must make the value you provide *C-contiguous* prior to copying it.
-  This can require an extra copy of the data on the host.
+* It is also worth mentioning that, current GPU copying routines
+  support only contiguous memory.  So Theano must make the value you
+  provide *C-contiguous* prior to copying it.  This can require an
+  extra copy of the data on the host.

  *Solution*: make sure that the value
-  you assign to a CudaNdarraySharedVariable is *already*  *C-contiguous*.
+  you assign to a GpuArraySharedVariable is *already*  *C-contiguous*.

 (Further information on the current implementation of the GPU version of ``set_value()`` can be found
-here: :ref:`libdoc_cuda_var`)
+here: :ref:`libdoc_gpuarray_var`)

 .. _borrowfunction:


--- a/doc/tutorial/examples.txt
+++ b/doc/tutorial/examples.txt
@@ -508,10 +508,9 @@ There are :ref:`other distributions implemented <libdoc_tensor_raw_random>`.
 Other Implementations
 ---------------------

-There are 2 other implementations based on :ref:`MRG31k3p
-<libdoc_rng_mrg>` and :class:`CURAND <theano.sandbox.cuda.rng_curand>`.
-The RandomStream only work on the CPU, MRG31k3p
-work on the CPU and GPU. CURAND only work on the GPU.
+There is another implementations based on :ref:`MRG31k3p
+<libdoc_rng_mrg>`.
+The RandomStream only work on the CPU, MRG31k3p work on the CPU and GPU.

 .. note::


--- a/doc/tutorial/gpu_data_convert.txt
+++ b/doc/tutorial/gpu_data_convert.txt
-.. _gpu_data_convert:
-
-===================================
-PyCUDA/CUDAMat/Gnumpy compatibility
-===================================
-
-PyCUDA
-======
-
-Currently, PyCUDA and Theano have different objects to store GPU
-data. The two implementations do not support the same set of features.
-Theano's implementation is called *CudaNdarray* and supports
-*strides*. It also only supports the *float32* dtype. PyCUDA's implementation
-is called *GPUArray* and doesn't support *strides*. However, it can deal with
-all NumPy and CUDA dtypes.
-
-We are currently working on having the same base object for both that will
-also mimic Numpy. Until this is ready, here is some information on how to
-use both objects in the same script.
-
-Transfer
--------
-
-You can use the ``theano.misc.pycuda_utils`` module to convert GPUArray to and
-from CudaNdarray. The functions ``to_cudandarray(x, copyif=False)`` and
-``to_gpuarray(x)`` return a new object that occupies the same memory space
-as the original. Otherwise it raises a *ValueError*. Because GPUArrays don't
-support strides, if the CudaNdarray is strided, we could copy it to
-have a non-strided copy. The resulting GPUArray won't share the same
-memory region. If you want this behavior, set ``copyif=True`` in
-``to_gpuarray``.
-
-Compiling with PyCUDA
---------------------
-
-You can use PyCUDA to compile CUDA functions that work directly on
-CudaNdarrays. Here is an example from the file ``theano/misc/tests/test_pycuda_theano_simple.py``:
-
-.. code-block:: python
-
-  import sys
-  import numpy
-  import theano
-  import theano.sandbox.cuda as cuda_ndarray
-  import theano.misc.pycuda_init
-  import pycuda
-  import pycuda.driver as drv
-  import pycuda.gpuarray
-
-
-  def test_pycuda_theano():
-      """Simple example with pycuda function and Theano CudaNdarray object."""
-      from pycuda.compiler import SourceModule
-      mod = SourceModule("""
-  __global__ void multiply_them(float *dest, float *a, float *b)
-  {
-    const int i = threadIdx.x;
-    dest[i] = a[i] * b[i];
-  }
-  """)
-
-      multiply_them = mod.get_function("multiply_them")
-
-      a = numpy.random.randn(100).astype(numpy.float32)
-      b = numpy.random.randn(100).astype(numpy.float32)
-  
-      # Test with Theano object
-      ga = cuda_ndarray.CudaNdarray(a)
-      gb = cuda_ndarray.CudaNdarray(b)
-      dest = cuda_ndarray.CudaNdarray.zeros(a.shape)
-      multiply_them(dest, ga, gb,
-                    block=(400, 1, 1), grid=(1, 1))
-      assert (numpy.asarray(dest) == a * b).all()
-
-
-Theano Op using a PyCUDA function
---------------------------------
-
-You can use a GPU function compiled with PyCUDA in a Theano op:
-
-.. code-block:: python
-
-    import numpy, theano
-    import theano.misc.pycuda_init
-    from pycuda.compiler import SourceModule
-    import theano.sandbox.cuda as cuda
-
-    class PyCUDADoubleOp(theano.Op):
-        __props__ = ()
-        def make_node(self, inp):
-            inp = cuda.basic_ops.gpu_contiguous(
-               cuda.basic_ops.as_cuda_ndarray_variable(inp))
-            assert inp.dtype == "float32"
-            return theano.Apply(self, [inp], [inp.type()])
-        def make_thunk(self, node, storage_map, _, _2, impl=None):
-            mod = SourceModule("""
-        __global__ void my_fct(float * i0, float * o0, int size) {
-        int i = blockIdx.x * blockDim.x + threadIdx.x;
-        if(i<size){
-            o0[i] = i0[i] * 2;
-        }
-      }""")
-            pycuda_fct = mod.get_function("my_fct")
-            inputs = [ storage_map[v] for v in node.inputs]
-            outputs = [ storage_map[v] for v in node.outputs]
-            def thunk():
-                z = outputs[0]
-                if z[0] is None or z[0].shape!=inputs[0][0].shape:
-                    z[0] = cuda.CudaNdarray.zeros(inputs[0][0].shape)
-                grid = (int(numpy.ceil(inputs[0][0].size / 512.)),1)
-                pycuda_fct(inputs[0][0], z[0], numpy.intc(inputs[0][0].size),
-                           block=(512, 1, 1), grid=grid)
-            thunk.lazy = False
-            return thunk
-    
-CUDAMat
-=======
-
-There are functions for conversion between CUDAMat objects and Theano's CudaNdArray objects. 
-They obey the same principles as Theano's PyCUDA functions and can be found in
-``theano.misc.cudamat_utils.py``.
-
-.. TODO: this statement is unclear:
-
-WARNING: There is a peculiar problem associated with stride/shape with those converters. 
-In order to work, the test needs a *transpose* and *reshape*...
-
-Gnumpy
-======
-
-There are conversion functions between Gnumpy *garray* objects and Theano CudaNdArray objects. 
-They are also similar to Theano's PyCUDA functions and can be found in ``theano.misc.gnumpy_utils.py``.
--- a/doc/tutorial/index.txt
+++ b/doc/tutorial/index.txt
@@ -70,7 +70,6 @@ Further readings

    ../extending/graphstructures 
    loading_and_saving
-    gpu_data_convert
    aliasing
    python-memory-management
    multi_cores

--- a/doc/tutorial/profiling.txt
+++ b/doc/tutorial/profiling.txt
@@ -82,21 +82,6 @@ Here is an example output when we disable some Theano optimizations to
 give you a better idea of the difference between sections. With all
 optimizations enabled, there would be only one op left in the graph.

-.. note::
-
-    To profile the peak memory usage on the GPU you need to do::
-
-        * In the file theano/sandbox/cuda/cuda_ndarray.cu, set the macro
-          COMPUTE_GPU_MEM_USED to 1.
-        * Then call theano.sandbox.cuda.theano_allocated()
-          It return a tuple with two ints. The first is the current GPU
-          memory allocated by Theano. The second is the peak  GPU memory
-          that was allocated by Theano.
-
-    Do not always enable this, as this slows down memory allocation and
-    free. As this slows down the computation, this will affect speed
-    profiling. So don't use both at the same time.
-
 to run the example:

  THEANO_FLAGS=optimizer_excluding=fusion:inplace,profile=True python doc/tutorial/profiling_example.py

--- a/doc/tutorial/using_gpu.txt
+++ b/doc/tutorial/using_gpu.txt
@@ -14,16 +14,14 @@ about how to carry out those computations.  One of the ways we take
 advantage of this flexibility is in carrying out calculations on a
 graphics card.

-There are two ways currently to use a gpu, one that should support any OpenCL
-device as well as NVIDIA cards (:ref:`gpuarray`), and the old backend that
-only supports NVIDIA cards (:ref:`cuda`).
-
-Using the GPU in Theano is as simple as setting the ``device`` configuration
-flag to ``device=cuda`` (or ``device=gpu`` for the old backend). You can optionally target a specific gpu by specifying
-the number of the gpu as in e.g. ``device=cuda2``. You also need to set the
-default floating point precision.
-For example: ``THEANO_FLAGS='cuda.root=/path/to/cuda/root,device=cuda,floatX=float32'``.
-You can also set these options in the .theanorc file's ``[global]`` section:
+Using the GPU in Theano is as simple as setting the ``device``
+configuration flag to ``device=cuda``. You can optionally target a
+specific gpu by specifying the number of the gpu as in
+e.g. ``device=cuda2``.  It is also encouraged to set the floating
+point precision to float32 when working on the GPU as that is usually
+much faster.  For example:
+``THEANO_FLAGS='device=cuda,floatX=float32'``.  You can also set these
+options in the .theanorc file's ``[global]`` section:

     .. code-block:: cfg

@@ -31,15 +29,10 @@ You can also set these options in the .theanorc file's ``[global]`` section:
        device = cuda
        floatX = float32

-.. warning::
-
-  The old CUDA backend will be deprecated soon, in favor of the new libgpuarray
-  backend.
-
 .. note::

-    * If your computer has multiple GPUs and you use ``device=cuda``, the driver
-      selects the one to use (usually gpu0).
+    * If your computer has multiple GPUs and you use ``device=cuda``,
+      the driver selects the one to use (usually cuda0).
    * You can use the program ``nvidia-smi`` to change this policy.
    * By default, when ``device`` indicates preference for GPU computations,
      Theano will fall back to the CPU if there is a problem with the GPU.
@@ -65,14 +58,8 @@ remainder of this section, whatever compute device you are using will
 be referred to as GPU.

 .. note::
-  GpuArray backend uses ``config.gpuarray.preallocate`` for GPU memory allocation.
-  For the old backend, please see ``config.lib.cnmem``
-
-.. warning::
-
-  If you want to use the new GpuArray backend, make sure to have the
-  development version of Theano installed. The 0.8.X releases have not
-  been optimized to work correctly with the new backend.
+  GpuArray backend uses ``config.gpuarray.preallocate`` for GPU memory
+  allocation.

 .. warning::

@@ -140,9 +127,10 @@ input *x* is stored on the GPU.
  Used the cpu

  $ THEANO_FLAGS=device=cuda0 python gpu_tutorial1.py
-  Mapped name None to device cuda0: GeForce GTX 680 (cuDNN version 5004)
+  Using cuDNN version 5105 on context None
+  Mapped name None to device cuda0: GeForce GTX 750 Ti (0000:07:00.0)
  [GpuElemwise{exp,no_inplace}(<GpuArrayType<None>(float64, (False,))>), HostFromGpu(gpuarray)(GpuElemwise{exp,no_inplace}.0)]
-  Looping 1000 times took 1.202734 seconds
+  Looping 1000 times took 1.697514 seconds
  Result is [ 1.23178032  1.61879341  1.52278065 ...,  2.20771815  2.29967753
    1.62323285]
  Used the gpu
@@ -197,9 +185,10 @@ The output is
   :options: +ELLIPSIS, +SKIP

   $ THEANO_FLAGS=device=cuda0 python gpu_tutorial2.py
-   Mapped name None to device cuda0: GeForce GTX 680 (cuDNN version 5004)
+   Using cuDNN version 5105 on context None
+   Mapped name None to device cuda0: GeForce GTX 750 Ti (0000:07:00.0)
   [GpuElemwise{exp,no_inplace}(<GpuArrayType<None>(float64, (False,))>)]
-   Looping 1000 times took 0.088381 seconds
+   Looping 1000 times took 0.040277 seconds
   Result is [ 1.23178032  1.61879341  1.52278065 ...,  2.20771815  2.29967753
     1.62323285]
   Used the gpu
@@ -208,9 +197,10 @@ The output is
 .. code-block:: none

  $ THEANO_FLAGS=device=cuda0 python gpu_tutorial2.py
-  Mapped name None to device cuda0: GeForce GTX 680 (cuDNN version 5004)
+  Using cuDNN version 5105 on context None
+  Mapped name None to device cuda0: GeForce GTX 750 Ti (0000:07:00.0)
  [GpuElemwise{exp,no_inplace}(<GpuArrayType<None>(float64, (False,))>)]
-  Looping 1000 times took 0.089194 seconds
+  Looping 1000 times took 0.040277 seconds
  Result is [ 1.23178032  1.61879341  1.52278065 ...,  2.20771815  2.29967753
    1.62323285]
  Used the gpu
@@ -238,8 +228,8 @@ device, and also as we refine our implementation:
 * In general, matrix multiplication, convolution, and large element-wise
  operations can be accelerated a lot (5-50x) when arguments are large enough
  to keep 30 processors busy.
-* Indexing, dimension-shuffling and constant-time reshaping will be equally fast
-  on GPU as on CPU.
+* Indexing, dimension-shuffling and constant-time reshaping will be
+  equally fast on GPU as on CPU.
 * Summation over rows/columns of tensors can be a little slower on the
  GPU than on the CPU.
 * Copying of large quantities of data to and from a device is relatively slow,
@@ -273,23 +263,22 @@ Tips for Improving Performance on GPU
 * Minimize transfers to the GPU device by using ``shared`` variables
  to store frequently-accessed data (see :func:`shared()<shared.shared>`).
  When using the GPU, tensor ``shared`` variables are stored on
-  the GPU by default to eliminate transfer time for GPU ops using those variables.
-* If you aren't happy with the performance you see, try running your script with
-  ``profile=True`` flag. This should print some timing information at program
-  termination. Is time being used sensibly?   If an op or Apply is
-  taking more time than its share, then if you know something about GPU
-  programming, have a look at how it's implemented in theano.gpuarray.
-  Check the line similar to *Spent Xs(X%) in cpu op, Xs(X%) in gpu op and
-  Xs(X%) in transfer op*. This can tell you if not enough of your graph is
-  on the GPU or if there is too much memory transfer.
+  the GPU by default to eliminate transfer time for GPU ops using those
+  variables.
+* If you aren't happy with the performance you see, try running your
+  script with ``profile=True`` flag. This should print some timing
+  information at program termination. Is time being used sensibly?  If
+  an op or Apply is taking more time than its share, then if you know
+  something about GPU programming, have a look at how it's implemented
+  in theano.gpuarray.  Check the line similar to *Spent Xs(X%) in cpu
+  op, Xs(X%) in gpu op and Xs(X%) in transfer op*. This can tell you
+  if not enough of your graph is on the GPU or if there is too much
+  memory transfer.
 * To investigate whether all the Ops in the computational graph are
  running on GPU, it is possible to debug or check your code by providing
  a value to `assert_no_cpu_op` flag, i.e. `warn`, for warning, `raise` for
  raising an error or `pdb` for putting a breakpoint in the computational
  graph if there is a CPU Op.
-* Please note that ``config.lib.cnmem`` and ``config.gpuarray.preallocate``
-  controls GPU memory allocation when using (:ref:`cuda`) and 
-  (:ref:`gpuarray`) as theano backends respectively.

  .. _gpu_async:

@@ -311,9 +300,9 @@ when doing benchmarks.
 Changing the Value of Shared Variables
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-To change the value of a ``shared`` variable, e.g. to provide new data to processes,
-use ``shared_variable.set_value(new_value)``. For a lot more detail about this,
-see :ref:`aliasing`.
+To change the value of a ``shared`` variable, e.g. to provide new data
+to processes, use ``shared_variable.set_value(new_value)``. For a lot
+more detail about this, see :ref:`aliasing`.

 Exercise
 ~~~~~~~~
@@ -389,50 +378,22 @@ Consider again the logistic regression:
   prediction on D
   ...

-Modify and execute this example to run on GPU with ``floatX=float32`` and
-time it using the command line ``time python file.py``. (Of course, you may use some of your answer
-to the exercise in section :ref:`Configuration Settings and Compiling Mode<using_modes>`.)
+Modify and execute this example to run on GPU with ``floatX=float32``
+and time it using the command line ``time python file.py``. (Of
+course, you may use some of your answer to the exercise in section
+:ref:`Configuration Settings and Compiling Mode<using_modes>`.)

 Is there an increase in speed from CPU to GPU?

 Where does it come from? (Use ``profile=True`` flag.)

-What can be done to further increase the speed of the GPU version? Put your ideas to test.
+What can be done to further increase the speed of the GPU version? Put
+your ideas to test.

 :download:`Solution<using_gpu_solution_1.py>`

 -------------------------------------------

-.. _cuda:
-
-CUDA backend
------------
-
-If you have not done so already, you will need to install Nvidia's
-GPU-programming toolchain (CUDA) and configure Theano to use it.
-We provide installation instructions for :ref:`Linux <gpu_linux>`,
-:ref:`MacOS <gpu_macos>` and :ref:`Windows <gpu_windows>`.
-
-The old CUDA backend can be activated using the flags ``device=gpu`` or
-``device=gpu{0,1,...}``
-
-.. note::
-   * CUDA backend uses ``config.lib.cnmem`` for GPU memory allocation. For the new backend(:ref:`gpuarray`), please see ``config.gpuarray.preallocate``
-   * Only 32 bit floats are supported.
-   * ``Shared`` variables with *float32* dtype are by default moved to the GPU memory space.
-
-   * There is a limit of one GPU per process.
-
-   * Apply the Theano flag ``floatX=float32`` (through ``theano.config.floatX``) in your code.
-   * ``Cast`` inputs before storing them into a ``shared`` variable.
-   * Circumvent the automatic cast of *int32* with *float32* to *float64*:
-
-     * Insert manual cast in your code or use *[u]int{8,16}*.
-     * Insert manual cast around the mean operator (this involves division by length, which is an *int64*).
-     * Notice that a new casting mechanism is being developed.
-
-------------------------------------------
-

 Software for Directly Programming a GPU
 ---------------------------------------

--- a/doc/tutorial/using_multi_gpu.txt
+++ b/doc/tutorial/using_multi_gpu.txt
@@ -64,8 +64,8 @@ defined.  This will look like this:
 .. code-block:: bash

   $ THEANO_FLAGS="contexts=dev0->cuda0;dev1->cuda1" python -c 'import theano'
-   Mapped name dev0 to device cuda0: GeForce GTX TITAN X
-   Mapped name dev1 to device cuda1: GeForce GTX TITAN X
+   Mapped name dev0 to device cuda0: GeForce GTX TITAN X (0000:09:00.0)
+   Mapped name dev1 to device cuda1: GeForce GTX TITAN X (0000:06:00.0)


 If you don't have enough GPUs for a certain model, you can assign the

--- a/theano/configdefaults.py
+++ b/theano/configdefaults.py
@@ -176,12 +176,6 @@ AddConfigVar(
    in_c_key=False)


-AddConfigVar(
-    'enable_initial_driver_test',
-    "Tests the nvidia driver when a GPU device is initialized.",
-    BoolParam(True, allow_override=False),
-    in_c_key=False)
-
 AddConfigVar('gpuarray.sync',
             """If True, every op will make sure its work is done before
                returning.  Setting this to True will slow down execution,