提交 5432363f authored 作者: Olivier Delalleau's avatar Olivier Delalleau

Merged

Trunk sin last release
------
* Sparse type is now supported by the shape op and the ShapeFeature optimizer work correctly with them.
* fuse GpuElemwise more often(in the case where their is too many inputs that fusing all of them would bust the 256 bytes limits of parameter to gpu function)
* Speed up gemv by a work around scipy gemv slowness when the matrix is in c order(the default)
* Fuse GpuElemwise more often (in the case where there are so many inputs that fusing them all would bust the 256 bytes limit of parameter to gpu function).
* Speed up gemv by a work around scipy gemv slowness when the matrix is in C order (the default).
Theano 0.3 (2010-11-23)
-----------------------
......
......@@ -19,7 +19,8 @@ instructions below for detailed installation steps):
Linux, Mac OS X or Windows operating system
We develop mainly on 64-bit Linux machines. 32-bit architectures are
not well-tested.
not well-tested. Note that GPU computing does not work yet under
Windows.
Python_ >= 2.4
The development package (``python-dev`` or ``python-devel``
......@@ -330,7 +331,7 @@ Mac
.. code-block:: bash
$ sudo port install gcc44 py25-scipy mercurial python_select
$ sudo port install gcc44 py26-scipy mercurial python_select
This will install all the required Theano dependencies. Note that
compiling gcc takes significant time (hours)! SciPy depends on ATLAS (a
......@@ -344,13 +345,13 @@ Mac
packages are updated quite frequently.
- In order to use the MacPorts version of python, you might
need to explicitly select it with ``sudo python_select python25``. The
need to explicitly select it with ``sudo python_select python26``. The
reason this is necessary is because you might have an Apple-provided python
(via, for example, an Xcode installation). After performing this step, you
should check that the symbolic link provided by ``which python`` points to
the MacPorts python. For instance, on Snow Leopard with the latest MacPorts,
the output of ``which python`` is ``/opt/local/bin/python`` and this symbolic
link points to ``/opt/local/bin/python2.5``. When executing ``sudo
link points to ``/opt/local/bin/python2.6``. When executing ``sudo
python_select python26-apple`` (which you should **not** do), the link
points to ``/usr/bin/python2.6``.
......@@ -364,7 +365,7 @@ Mac
- Please follow the same procedure with ``numpy``.
- Put ``export PYTHONPATH=/opt/local/lib/python2.5/site-packages:$PYTHONPATH``
- Put ``export PYTHONPATH=/opt/local/lib/python2.6/site-packages:$PYTHONPATH``
in your ``.bashrc`` in order to include your MacPorts Python packages
(NumPy, SciPy) in Python's path.
......@@ -469,7 +470,7 @@ components as in Python(x,y) that are required by Theano, follow these steps:
sub-directory are in your system path. This may be done by
modifying the global ``PATH`` Windows environment variables, or by creating
a ``.profile`` file in your MinGW home, containing a line like
``export PATH=$PATH:/c/Python27:/c/Python27/Scripts`` (note that the latter
``export PATH=$PATH:/c/Python26:/c/Python26/Scripts`` (note that the latter
will work only when you run Theano from a MinGW shell).
- In order to run Theano's test-suite, you will need `nose
......@@ -661,10 +662,14 @@ follows:
Using the GPU
~~~~~~~~~~~~~
Please note that these are tentative instructions (we have not yet been able to
get the GPU to work under Windows with Theano).
Please report your own successes / failures on the `theano-users`_ mailing list.
At this point, GPU computing does not work under Windows. The current main
issue is that the compilation commands used under Linux / MacOS to create
and use a CUDA-based shared library with the nvcc compiler do not work with
Windows DLLs. If anyone can figure out the proper compilation steps for
Windows, please let us know on the `theano-dev`_ mailing list.
Instructions below should at least get you started so you can reproduce the
above-mentioned issue.
Those are instructions for the 32-bit version of Python (the one that comes
with Python(x,y) is 32-bit).
......@@ -679,44 +684,47 @@ use a compilation directory located somewhere else:
[global]
base_compiledir=path_to_a_directory_without_such_characters
You also need to add in the configuration file those lines (make sure this
is the correct Python installation path):
.. code-block:: cfg
[cuda]
nvccflags=-LC:\Python27\libs
Then
1) Install CUDA driver (32-bit on 32-bit Windows, idem for 64-bit).
2) Install CUDA toolkit 32-bit (even if you computer is 64-bit,
must match the Python installation version)
must match the Python installation version).
3) Install CUDA SDK 32-bit
3) Install CUDA SDK 32-bit.
4) Test some pre-compiled example of the sdk
4) Test some pre-compiled example of the sdk.
5) Download Visual Studio 2008 Express(free, VS2010 not supported by nvcc 3.1,
VS2005, not available for download, but supported by nvcc, the non free version should work too)
5) Download Visual Studio 2008 Express (free, VS2010 not supported by nvcc 3.1,
VS2005 is not available for download but supported by nvcc, the non free version should work too).
6) Follow the instruction in the GettingStartedWindows.pdf file from CUDA web
site to compile CUDA code with VS2008. If that don't work, you won't be
able to compile GPU code with Theano.
6) Follow the instruction in the GettingStartedWindows.pdf file from the CUDA web
site to compile CUDA code with VS2008. If that does not work, you will
not be able to compile GPU code with Theano.
7) Put into you PATH environment variable the directory where cl.exe is.
In my case it is: ``C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin``
7) Edit your Theano configuration file to add lines like the following
(make sure these paths match your own specific installation):
8) Make sure the Theano folder is in your ``PYTHONPATH`` environment variable.
.. code-block:: cfg
9) Then in Python do: ``import theano.sandbox.cuda``
[cuda]
nvccflags=-LC:\Python26\libs
[nvcc]
compiler_bindir=C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin
8) In Python do: ``import theano.sandbox.cuda``. This will compile the
first CUDA file, and no error should occur.
9) Then run the Theano CUDA test files with nosetests from the
``theano/sandbox/cuda/tests`` subdirectory. In the current version of
Theano, this should fail with an error like:
.. code-block:: bash
That will print some error if their is an error to compile the first CUDA file.
NVCC: nvcc fatal: Don't know what to do with
'C:/CUDA/compile/tmpmkgqx6/../cuda_ndarray/cuda_ndarray.pyd'
10) Then run the Theano CUDA test file. In Windows command line (cmd.exe),
run the program nosetests inside the Theano repository.
nosetests is installed by Python(x,y).
Generating the documentation
----------------------------
......@@ -739,3 +747,4 @@ The PDF of the documentation is ``html/theano.pdf``.
.. _theano-users: http://groups.google.com/group/theano-users?pli=1
.. _theano-dev: http://groups.google.com/group/theano-dev?pli=1
......@@ -11,6 +11,7 @@ If you're feeling ambitious, go fix some `pylint
.. toctree::
:maxdepth: 2
release
dev_start_guide
lisa_labo
mammouth
......
......@@ -20,6 +20,7 @@ Types and Ops that you can use to build and compile expression graphs.
scalar/index
gof/index
scan
sandbox/index
There are also some top-level imports that you might find more convenient:
......
.. _libdoc_sandbox_cuda:
===========================================
:mod:`sandbox.cuda` -- The CUDA GPU backend
===========================================
.. module:: sandbox.cuda
:platform: Unix, Windows
:synopsis: Code for GPU programming
.. moduleauthor:: LISA
.. toctree::
:maxdepth: 1
var
type
.. ../../../../theano/sandbox/cuda/type.py
.. ../../../../theano/sandbox/cuda/var.py
.. ../../../../theano/sandbox/cuda/
.. _libdoc_cuda_type:
======================================================================
:mod:`sandbox.cuda.type` -- The Type object for Cuda-allocated arrays
======================================================================
.. module:: sandbox.cuda.type
:platform: Unix, Windows
:synopsis: The Type object for CUDA-allocated arrays
.. moduleauthor:: LISA
API
===
.. ../../../../theano/sandbox/cuda/type.py
.. ../../../../theano/sandbox/cuda/var.py
.. ../../../../theano/sandbox/cuda/
.. _libdoc_cuda_var:
===================================================================
:mod:`sandbox.cuda.var` -- The Variables for Cuda-allocated arrays
===================================================================
.. module:: sandbox.cuda.var
:platform: Unix, Windows
:synopsis: The Variables object for CUDA-allocated arrays
.. moduleauthor:: LISA
API
===
.. autoclass:: theano.sandbox.cuda.var.CudaNdarraySharedVariable
:members: get_value, set_value
.. _libdoc_sandbox:
==============================================================
:mod:`sandbox` -- Experimental Code
==============================================================
.. module:: sandbox
:platform: Unix, Windows
:synopsis: Experimental code
.. moduleauthor:: LISA
.. toctree::
:maxdepth: 1
cuda/index
......@@ -142,7 +142,7 @@ transparent. But when you are using a GPU (or in future perhaps a remote machin
is not the internal representation of your data.
If you really want Theano to return its internal representation *and never copy it*
then you should use the ``return_internal_type=True`` argument to
``get_value``. It will never copy the internal object (always return in
``get_value``. It will never cast the internal object (always return in
constant time), but might return various datatypes depending on contextual
factors (e.g. the compute device, the dtype of the numpy array).
......@@ -154,6 +154,12 @@ It is possible to use ``borrow=False`` in conjunction with
``return_internal_type=True``, which will return a deep copy of the internal object.
This is primarily for internal debugging, not for typical use.
For the transparent use of different type of optimization Theano can make,
there is the policy that get_value() always return by default the same object type
it received when the shared variable was created. So if you created manually data on
the gpu and create a shared variable on the gpu with this data, get_value will always
return gpu data event when return_internal_type=False.
*Take home message:*
It is safe (and sometimes much faster) to use ``get_value(borrow=True)`` when
......@@ -182,6 +188,30 @@ This pattern works regardless of the compute device, and when the compute device
makes it possible to expose Theano's internal variables without a copy, then it
goes as fast as an in-place update.
When shared variables are allocated on the GPU, the transfers to and from GPU device memory can
be costly. Here are a few tips to ensure fast and efficient use of GPU memory and bandwidth:
* Prior to Theano 0.3.1, set_value did not work in-place on the GPU. This meant that sometimes,
GPU memory for the new value would be allocated before the old memory was released. If you're
running near the limits of GPU memory, this could cause you to run out of GPU memory
unnecessarily. *Solution*: update to a newer version of Theano.
* If you are going to swap several chunks of data in and out of a shared variable repeatedly,
you will want to reuse the memory that you allocated the first time if possible - it is both
faster and more memory efficient.
*Solution*: upgrade to a recent version of Theano (>0.3.0) and consider padding your source
data to make sure that every chunk is the same size.
* It is also worth mentioning that, current GPU copying routines support only contiguous memory.
So Theano must make the ``value`` you provide ``c_contiguous`` prior to copying it.
This can require an extra copy of the data on the host. *Solution*: make sure that the value
you assign to a CudaNdarraySharedVariable is *already* ``c_contiguous``.
(Further remarks on the current implementation of the GPU version of set_value() can be found
here: :ref:`libdoc_cuda_var`)
Retrieving and assigning via the .value property
------------------------------------------------
......
......@@ -21,7 +21,7 @@ Toolkit installs a folder on your computer with subfolders *bin*, *lib*,
*include*, and some more too. (Sanity check: The *bin* subfolder should contain an *nvcc*
program which is the compiler for GPU code.) This folder is called the *cuda
root* directory.
On Linux or OS-X >= 10.4, you must add the 'lib' subdirectory (and/or 'lib64' subdirectory if you have a 64-bit
On Linux or OS X >= 10.4, you must add the 'lib' subdirectory (and/or 'lib64' subdirectory if you have a 64-bit Linux
computer) to your ``$LD_LIBRARY_PATH`` environment variable.
......@@ -274,3 +274,10 @@ Tips for improving performance on GPU
that can tell you if not enought of your graph is on the gpu or if their
is too much memory transfert.
Changing the value of shared variables
--------------------------------------
To change the value of a shared variable, e.g. to provide new data to process,
use ``shared_variable.set_value(new_value)``. For a lot more detail about this,
see :ref:`aliasing`.
\ No newline at end of file
......@@ -138,7 +138,9 @@ class SharedVariable(Variable):
def filter_update(self, update):
"""When this shared variable is updated by a pfunc, the update value will be run through this function.
"""
When this shared variable is updated by a pfunc, the update value will be run through this function.
This is a good spot to cast or convert the update expression as necessary.
Default behaviour is to return `update` unmodified if it is a Variable, otherwise to create a SharedVariable for it by calling ``shared(update)``.
......
......@@ -7,10 +7,12 @@ import re
from theano.configparser import config, AddConfigVar, StrParam
def default_compiledirname():
platform_id = platform.platform() + '-' + platform.processor()
platform_id += ('-' + platform.python_version())
platform_id = '-'.join([
platform.platform(),
platform.processor(),
platform.python_version()])
platform_id = re.sub("[\(\)\s]+", "_", platform_id)
return 'compiledir_'+platform_id
return 'compiledir_' + platform_id
def is_valid_compiledir(path):
if not os.access(path, os.R_OK | os.W_OK):
......
......@@ -163,19 +163,16 @@ class Container(object):
if value is None:
self.storage[0] = None
return
if self.type.__class__.__name__ == "CudaNdarrayType" and isinstance(value,numpy.ndarray):
#The filter method of CudaNdarray alloc a new memory region on the gpu.
#The ref count will be decremented after that.
#That cause 2 region allocated at the same time!
#We decrement the memory reference conter now to try to lower the memory usage.
self.storage[0] = None
kwargs = {}
if self.strict:
kwargs['strict'] = True
if self.allow_downcast is not None:
kwargs['allow_downcast'] = self.allow_downcast
self.storage[0] = self.type.filter(value, **kwargs)
if hasattr(self.type,'filter_inplace'):
self.storage[0] = self.type.filter_inplace(value, self.storage[0], **kwargs)
else:
self.storage[0] = self.type.filter(value, **kwargs)
except Exception, e:
e.args = e.args + (('Container name "%s"' % self.name),)
......
......@@ -89,7 +89,7 @@ if __name__ == "__main__":
* numpy with ATLAS from distribution(FC9) package (1 thread)
* manually compiled numpy and ATLAS with 2 threads
* goto with 1, 2, 4 and 8 threads.
Xeon Xeno Xeon Core2 i7
Xeon Xeon Xeon Core2 i7
lib/nb threads E5345 E5430 E5450 E8500 930
numpy_FC9_atlas/1 39.2s 35.0s 30.7s 29.6s 21.5s
......
......@@ -33,8 +33,8 @@ else:
except NotImplementedError:
b_sparse = False
a_cuda=False
b_cuda=False
a_cuda = False
b_cuda = False
if a.__class__.__name__ == "CudaNdarray":
a_cuda = True
if b.__class__.__name__ == "CudaNdarray":
......
......@@ -37,17 +37,18 @@ def test_may_share_memory():
#test that it raise error when needed.
for a_,b_,rep in [(a,(0,),False),(a,1,False),(a,None,False),]:
if rep == False:
try:
may_share_memory(a_,b_)
raise Exception("An error was expected")
except TypeError:
pass
try:
may_share_memory(b_,a_)
raise Exception("An error was expected")
except TypeError:
pass
assert may_share_memory(a_,b_,False)==rep
assert may_share_memory(b_,a_,False)==rep
try:
may_share_memory(a_,b_)
raise Exception("An error was expected")
except TypeError:
pass
try:
may_share_memory(b_,a_)
raise Exception("An error was expected")
except TypeError:
pass
if scipy_imported:
def test_may_share_memory_scipy():
......@@ -64,14 +65,18 @@ if scipy_imported:
assert may_share_memory(a_,b_)==rep
assert may_share_memory(b_,a_)==rep
if rep == False:
try:
may_share_memory(a_,b_)
raise Exception("An error was expected")
except:
pass
try:
may_share_memory(b_,a_)
raise Exception("An error was expected")
except:
pass
#test that it raise error when needed.
for a_,b_,rep in [(a,(0,),False),(a,1,False),(a,None,False)]:
assert may_share_memory(a_,b_,False)==rep
assert may_share_memory(b_,a_,False)==rep
try:
may_share_memory(a_,b_)
raise Exception("An error was expected")
except TypeError:
pass
try:
may_share_memory(b_,a_)
raise Exception("An error was expected")
except TypeError:
pass
......@@ -51,12 +51,11 @@ def set_cuda_disabled():
#cuda_ndarray compile and import
cuda_path = os.path.abspath(os.path.split(__file__)[0])
date = os.stat(os.path.join(cuda_path,'cuda_ndarray.cu'))[stat.ST_MTIME]
date = max(date,os.stat(os.path.join(cuda_path,'cuda_ndarray.cuh'))[stat.ST_MTIME])
date = max(date,os.stat(os.path.join(cuda_path,'conv_full_kernel.cu'))[stat.ST_MTIME])
date = max(date,os.stat(os.path.join(cuda_path,'conv_kernel.cu'))[stat.ST_MTIME])
cuda_files = ('cuda_ndarray.cu', 'cuda_ndarray.cuh', 'conv_full_kernel.cu', 'conv_kernel.cu')
stat_times = [os.stat(os.path.join(cuda_path, cuda_file))[stat.ST_MTIME] for cuda_file in cuda_files]
date = max(stat_times)
cuda_ndarray_loc = os.path.join(config.compiledir, 'cuda_ndarray')
cuda_ndarray_loc = os.path.join(config.compiledir,'cuda_ndarray')
cuda_ndarray_so = os.path.join(cuda_ndarray_loc,
'cuda_ndarray.' + get_lib_extension())
compile_cuda_ndarray = True
......@@ -87,7 +86,7 @@ try:
'cuda_ndarray',
code,
location=cuda_ndarray_loc,
include_dirs=[cuda_path], libs=['cublas'])
include_dirs=[cuda_path], libs=['cublas'])
from cuda_ndarray.cuda_ndarray import *
except Exception, e:
......@@ -105,17 +104,19 @@ if cuda_available:
cuda_available = False
cuda_initialization_error_message = e.message
# We must do those import to be able to create the full doc when nvcc
from theano.sandbox.cuda.var import (CudaNdarrayVariable,
CudaNdarrayConstant,
CudaNdarraySharedVariable,
float32_shared_constructor)
from theano.sandbox.cuda.type import CudaNdarrayType
if cuda_available:
#check if their is an old cuda_ndarray that was loading instead of the one we compiled!
import cuda_ndarray.cuda_ndarray
if cuda_ndarray_so != cuda_ndarray.cuda_ndarray.__file__:
warning("WARNING: cuda_ndarray was loaded from",cuda_ndarray.cuda_ndarray.__file__,"This is not expected as theano should compile it automatically for you. Do you have a directory called cuda_ndarray in your LD_LIBRARY_PATH environment variable? If so, please remove it as it is outdated!")
from theano.sandbox.cuda.type import CudaNdarrayType
from theano.sandbox.cuda.var import (CudaNdarrayVariable,
CudaNdarrayConstant,
CudaNdarraySharedVariable,
float32_shared_constructor)
shared_constructor = float32_shared_constructor
import basic_ops
......
......@@ -1701,20 +1701,6 @@ class GpuSubtensor(tensor.Subtensor):
cdata = cdata[0]
out[0] = x.__getitem__(cdata)
if 0:
# JSB: commenting this out because it breaks code and does not look right
# Dumi could you try to run the examples in the DeepLearningBenchmarks
# for example? This logic doesn't seem right -- we just
# cast cdata to a tuple, so it doesn't have a .start field.
# some numpy installations don't expose the __index__() methods for
# numpy.int8/16/32/64. Casting to python's int instead
start = int(cdata.start) if cdata.start!=None else None
stop = int(cdata.stop) if cdata.stop!=None else None
step = int(cdata.step) if cdata.step!=None else None
newslice = slice(start,stop,step)
out[0] = x.__getitem__(newslice)
class GpuIncSubtensor(tensor.IncSubtensor):
def make_node(self, x, y, *inputs):
assert isinstance(x.type, CudaNdarrayType)
......
......@@ -819,20 +819,46 @@ def test_shared_float32():
# Unregister
del theano.shared.constructors[-1]
def test_shared_cudandarray():
'''Test that we can create a CudaNdarraySharedVariable from a CudaNdarray'''
a = cuda.shared_constructor(cuda.CudaNdarray.zeros((2,3)))
assert isinstance(a.type, tcn.CudaNdarrayType)
import theano.tensor.tests.test_sharedvar
#This test the case when the shared constructor view an CudaNdarray as input
test_shared_options = theano.tensor.tests.test_sharedvar.makeSharedTester(
shared_constructor_ = tcn.shared_constructor,
dtype_ = 'float32',
get_value_borrow_true_alias_ = True,
shared_borrow_true_alias_ = True,#True when the original value is already a CudaNdarray!
set_value_borrow_true_alias_ = True,
set_value_inplace_ = True,
set_casted_value_inplace_ = False,
shared_constructor_accept_ndarray_ = True,
internal_type_ = cuda_ndarray.CudaNdarray,
test_internal_type_ = lambda a: isinstance(a,cuda_ndarray.CudaNdarray),
theano_fct_ = theano.tensor.exp,
ref_fct_ = numpy.exp,
cast_value_ = cuda_ndarray.CudaNdarray,
op_by_matrix_ = True)
#This test the case when the shared constructor view an ndarray as input
test_shared_options2 = theano.tensor.tests.test_sharedvar.makeSharedTester(
shared_constructor_ = tcn.shared_constructor,
dtype_ = 'float32',
get_value_borrow_true_alias_ = False,
shared_borrow_true_alias_ = False,
set_value_borrow_true_alias_ = False,
set_value_inplace_ = True,
set_casted_value_inplace_ = True,
shared_constructor_accept_ndarray_ = True,
internal_type_ = cuda_ndarray.CudaNdarray,
test_internal_type_ = lambda a: isinstance(a,cuda_ndarray.CudaNdarray),
theano_fct_ = theano.tensor.exp,
ref_fct_ = numpy.exp,
cast_value_ = numpy.asarray,
add_matrix_ = True)
op_by_matrix_ = True)
if __name__ == '__main__':
test_many_arg_elemwise()
......
......@@ -65,6 +65,48 @@ def test_softmax_optimizations():
assert env.outputs[0].owner.inputs[0].owner.op == cuda.host_from_gpu
assert env.outputs[0].owner.inputs[0].owner.inputs[0].owner.op == cuda.nnet.gpu_crossentropy_softmax_argmax_1hot_with_bias
def test_may_share_memory_cuda():
from theano.misc.may_share_memory import may_share_memory
a = cuda.CudaNdarray(numpy.zeros((3,4),dtype='float32'))
b = cuda.CudaNdarray(numpy.zeros((3,4),dtype='float32'))
na = numpy.zeros((3,4))
nb = numpy.zeros((3,4))
va = a.view()
vb = b.view()
ra = a.reshape((4,3))
rb = b.reshape((4,3))
#can't test the transpose as ta._strides = is not implemented
#manual transpose of a
#ta = a.reshape((4,3))
#ta._strides = (ta._strides[1],ta._strides[0])#not implemented
#elem_size=elem_size = numpy.zeros(0,dtype=a.dtype).dtype.itemsize
#ta.gpudata += ta.size*elem_size
for a_,b_,rep in [(a,a,True),(b,b,True),(a,b,False),
(a,na,False),(b,nb,False),(na,b,False),(nb,a,False),
(a,va,True),(b,vb,True),(va,b,False),(a,vb,False),
(a,ra,True),(b,rb,True),(ra,b,False),(a,rb,False),
]:
assert may_share_memory(a_,b_)==rep
assert may_share_memory(b_,a_)==rep
#test that it raise error when needed.
for a_,b_,rep in [(a,(0,),False),(a,1,False),(a,None,False)]:
assert may_share_memory(a_,b_,False)==rep
assert may_share_memory(b_,a_,False)==rep
try:
may_share_memory(a_,b_)
raise Exception("An error was expected")
except TypeError:
pass
try:
may_share_memory(b_,a_)
raise Exception("An error was expected")
except TypeError:
pass
def test_grad_sqrt_sum():
"""
This trigered a bug in the past.
......
......@@ -8,10 +8,13 @@ from theano import Op, Type, Apply, Variable, Constant
from theano import tensor, config
from theano import scalar as scal
import cuda_ndarray.cuda_ndarray as cuda
import cuda_ndarray
from theano.sandbox.cuda.nvcc_compiler import nvcc_module_compile_str
try:
# We must do those import to be able to create the full doc when nvcc
import cuda_ndarray.cuda_ndarray as cuda
from theano.sandbox.cuda.nvcc_compiler import nvcc_module_compile_str
import cuda_ndarray
except ImportError:
pass
class CudaNdarrayType(Type):
......@@ -53,14 +56,18 @@ class CudaNdarrayType(Type):
self.dtype_specs() # error checking is done there
def filter(self, data, strict=False, allow_downcast=None):
return self.filter_inplace(data, None, strict=strict, allow_downcast=allow_downcast)
def filter_inplace(self, data, old_data, strict=False, allow_downcast=None):
if strict or allow_downcast or isinstance(data, cuda.CudaNdarray):
return cuda.filter(data, self.broadcastable, strict, None)
return cuda.filter(data, self.broadcastable, strict, old_data)
else: # (not strict) and (not allow_downcast)
# Check if data.dtype can be accurately casted to self.dtype
if isinstance(data, numpy.ndarray):
up_dtype = scal.upcast(self.dtype, data.dtype)
if up_dtype == self.dtype:
return cuda.filter(data, self.broadcastable, strict, None)
return cuda.filter(data, self.broadcastable, strict, old_data)
else:
raise TypeError(
'%s, with dtype %s, cannot store a value of '
......@@ -75,10 +82,10 @@ class CudaNdarrayType(Type):
type(data) is float and
self.dtype==theano.config.floatX):
return cuda.filter(converted_data, self.broadcastable,
strict, None)
strict, old_data)
elif numpy.all(data == converted_data):
return cuda.filter(converted_data, self.broadcastable,
strict, None)
strict, old_data)
else:
raise TypeError(
'%s, with dtype %s, cannot store accurately value %s, '
......@@ -87,6 +94,7 @@ class CudaNdarrayType(Type):
% (self, self.dtype, data, converted_data, self.dtype),
data)
@staticmethod
def bound(a):
high = a.gpudata
......@@ -112,10 +120,11 @@ class CudaNdarrayType(Type):
if a.__class__ is b.__class__:
a_l, a_h = CudaNdarrayType.bound(a)
b_l, b_h = CudaNdarrayType.bound(b)
if b_l>=a_h or a_l >= b_h:
if b_l >= a_h or a_l >= b_h:
return False
return True
else: return False
else:
return False
@staticmethod
def values_eq(a, b):
......@@ -352,4 +361,8 @@ copy_reg.constructor(CudaNdarray_unpickler)
def CudaNdarray_pickler(cnda):
return (CudaNdarray_unpickler, (numpy.asarray(cnda),))
copy_reg.pickle(cuda.CudaNdarray, CudaNdarray_pickler, CudaNdarray_unpickler)
try:
# In case cuda is not imported.
copy_reg.pickle(cuda.CudaNdarray, CudaNdarray_pickler, CudaNdarray_unpickler)
except NameError:
pass
......@@ -8,15 +8,18 @@ from theano import tensor
from theano.compile import SharedVariable
from theano.sandbox.cuda.type import CudaNdarrayType
from theano.sandbox.cuda import filter as type_support_filter
from theano.sandbox.cuda.basic_ops import HostFromGpu, GpuFromHost
try:
# We must do those import to be able to create the full doc when nvcc
from theano.sandbox.cuda import filter as type_support_filter
from theano.sandbox.cuda.basic_ops import HostFromGpu, GpuFromHost
except ImportError:
pass
class _operators(tensor.basic._tensor_py_operators):
"""Define a few properties and conversion methods for CudaNdarray Variables.
The default implementation of arithemetic operators is to build graphs of TensorType
variables.
variables.
The optimization pass (specialization) will insert pure GPU implementations.
This approach relieves the Cuda-Ops of having to deal with input argument checking and
......@@ -49,9 +52,34 @@ class CudaNdarrayConstant(Constant, _operators):
CudaNdarrayType.Constant = CudaNdarrayConstant
class CudaNdarraySharedVariable(SharedVariable, _operators):
"""
Shared Variable interface to CUDA-allocated arrays
"""
get_value_return_ndarray = True
def get_value(self, borrow=False, return_internal_type=False):
if return_internal_type: # return a cuda_ndarray
"""
Return the value of this SharedVariable's internal array.
:param borrow:
permit the return of internal storage, when used in conjunction with
``return_internal_type=True``
:param return_internal_type:
True to return the internal ``cuda_ndarray`` instance rather than a ``numpy.ndarray``
(Default False)
By default ``get_value()`` copies from the GPU to a ``numpy.ndarray`` and returns that
host-allocated array.
``get_value(False,True)`` will return a GPU-allocated copy of the original GPU array.
``get_value(True,True)`` will return the original GPU-allocated array without any
copying.
"""
if return_internal_type or not self.get_value_return_ndarray:
# return a cuda_ndarray
if borrow:
return self.container.value
else:
......@@ -60,6 +88,37 @@ class CudaNdarraySharedVariable(SharedVariable, _operators):
return numpy.asarray(self.container.value)
def set_value(self, value, borrow=False):
"""
Assign `value` to the GPU-allocated array.
:param borrow: ``True`` permits reusing `value` itself, ``False`` requires that this function
copies `value` into internal storage.
:note:
Prior to Theano 0.3.1, set_value did not work in-place on the GPU. This meant that sometimes,
GPU memory for the new value would be allocated before the old memory was released. If you're
running near the limits of GPU memory, this could cause you to run out of GPU memory.
Beginning with Theano 0.3.1, set_value will work in-place on the GPU, if the following conditions
are met:
* The destination on the GPU must be c_contiguous.
* The source is on the CPU.
* The old value must have the same dtype as the new value (which is a given for now,
since only float32 is supported).
* The old and new value must have the same shape.
* The old value is being completely replaced by the new value (not partially modified,
e.g. by replacing some subtensor of it).
* You change the value of the shared variable via set_value, not via the .value
accessors. You should not use the .value accessors anyway, since they will soon be
deprecated and removed.
It is also worth mentioning that, for efficient transfer to the GPU, Theano will make the new data
``c_contiguous``. This can require an extra copy of the data on the host.
This work what when borrow=True and when borrow=False
"""
if not borrow:
#TODO: check for cuda_ndarray type
if not isinstance(value, numpy.ndarray):
......@@ -84,11 +143,11 @@ CudaNdarrayType.SharedVariable = CudaNdarraySharedVariable
def cuda_shared_constructor(value, name=None, strict=False,
allow_downcast=None, borrow=False, broadcastable=None):
"""SharedVariable Constructor for TensorType"""
"""SharedVariable Constructor for CudaNdarrayType"""
# THIS CONSTRUCTOR TRIES TO CAST VALUE TO A FLOAT32, WHICH THEN GOES ONTO THE CARD
# SO INT shared vars, float64 shared vars, etc. all end up on the card.
# THIS IS NOT THE DEFAULT BEHAVIOUR THAT WE WANT.
# THIS IS NOT THE DEFAULT BEHAVIOUR THAT WE WANT.
# SEE float32_shared_constructor
#TODO: what should strict mean in this context, since we always have to make a copy?
......@@ -115,22 +174,34 @@ def cuda_shared_constructor(value, name=None, strict=False,
def float32_shared_constructor(value, name=None, strict=False,
allow_downcast=None, borrow=False, broadcastable=None):
"""SharedVariable Constructor for TensorType"""
"""SharedVariable Constructor for CudaNdarrayType from numpy.ndarray or CudaNdarray"""
# if value isn't a float32 ndarray, then raise
if not isinstance(value, numpy.ndarray):
raise TypeError('ndarray required')
if value.dtype.num != CudaNdarrayType.typenum:
# if value isn't a float32 ndarray, or a CudaNdarray then raise
if not isinstance(value, (numpy.ndarray, theano.sandbox.cuda.CudaNdarray)):
raise TypeError('ndarray or CudaNdarray required')
if isinstance(value, numpy.ndarray) and value.dtype.num != CudaNdarrayType.typenum:
raise TypeError('float32 ndarray required')
if broadcastable is None:
broadcastable = (False,) * len(value.shape)
type = CudaNdarrayType(broadcastable=broadcastable)
deviceval = type_support_filter(value, broadcastable, False, None)
get_value_return_ndarray = True
if isinstance(value, theano.sandbox.cuda.CudaNdarray):
get_value_return_ndarray = False
if borrow:
deviceval = value
else:
deviceval = value.copy()
else:
deviceval = type_support_filter(value, broadcastable, False, None)
try:
rval = CudaNdarraySharedVariable(type=type, value=deviceval, name=name, strict=strict)
except Exception, e:
print "ERROR", e
raise
return rval
rval.get_value_return_ndarray = get_value_return_ndarray
return rval
......@@ -400,11 +400,16 @@ def test_neibs_grad_verify_grad_warp_centered():
try:
unittest_tools.verify_grad(fn, [images_val], mode=mode_without_gpu)
raise Exception("Expected an error")
if cuda.cuda_available:
unittest_tools.verify_grad(fn, [images_val], mode=mode_with_gpu)
except NotImplementedError:
pass
if cuda.cuda_available:
try:
unittest_tools.verify_grad(fn, [images_val], mode=mode_with_gpu)
raise Exception("Expected an error")
except NotImplementedError:
pass
if __name__ == '__main__':
#test_neibs_gpu()
#test_neibs()
......
......@@ -44,7 +44,7 @@ import tensor
import misc.safe_asarray as safe_asarray
from tensor import opt, TensorType
import gof
from gof import Optimizer, toolbox, Op, Apply
from gof import Optimizer, toolbox, Op, Apply, Variable
from compile import optdb, SharedVariable, function, Param
import compile
import gradient
......@@ -1559,8 +1559,15 @@ class Scan(Op):
theano.config.floatX))
inner_gfn_ins = inner_g_outs + self.inputs
g_args = [self.n_steps] + g_outs[:self.n_outs_not_shared] \
+ scan_outputs + args[1:]
# Make sure you don't have numbers in here
if not isinstance(self.n_steps, Variable):
n_steps = tensor.as_tensor(self.n_steps)
else:
n_steps = self.n_steps
g_args = [n_steps] + g_outs[:self.n_outs_not_shared] \
+ scan_outputs + args[1:]
truncate_gradient = self.truncate_gradient
for x in self.store_steps[:self.n_outs_not_shared]:
if x>0 :
......@@ -1571,8 +1578,11 @@ class Scan(Op):
self.n_seqs, self.n_outs, self.n_outs_not_shared,
self.go_backwards, self.seqs_taps, self.outs_taps,
truncate_gradient)
g_scan_outs = g_scan(g_args)
# We need to add several None's fpr shared vars with updates
if not type(g_scan_outs) in (list, tuple):
g_scan_outs = [ g_scan_outs ]
# We need to add several None's for shared vars with updates
gradients = [None] + g_scan_outs[:self.n_seqs+self.n_outs_not_shared]
gradients += [None for i in xrange(self.n_outs-self.n_outs_not_shared)]
gradients += g_scan_outs[self.n_seqs+self.n_outs_not_shared:]
......
......@@ -15,7 +15,7 @@ def sparse_constructor(value, name=None, strict=False, allow_downcast=None,
writeme
"""
if not isinstance(value, scipy.sparse.spmatrix):
raise TypeError()
raise TypeError("Expected a sparse matrix in the sparse shared variable constructor. Received: ",value.__class__)
if format is None:
format = value.format
......@@ -24,5 +24,3 @@ def sparse_constructor(value, name=None, strict=False, allow_downcast=None,
value = copy.deepcopy(value)
return SparseTensorSharedVariable(type=type, value=value, name=name,
strict=strict, allow_downcast=allow_downcast)
......@@ -468,7 +468,7 @@ def test_shape_i():
a = SparseType('csr', dtype=sparse_dtype)()
f = theano.function([a], a.shape[1], mode='FAST_RUN')
assert f(sp.csr_matrix(random_lil((100,10), sparse_dtype, 3)))==(10)
assert f(sp.csr_matrix(random_lil((100,10), sparse_dtype, 3))) == 10
def test_shape():
# Test that getting the shape of a sparse variable
......@@ -501,11 +501,20 @@ def test_may_share_memory():
import theano.tensor.tests.test_sharedvar
test_shared_options=theano.tensor.tests.test_sharedvar.makeSharedTester(
theano.sparse.shared, 'float64',
True, True, True, scipy.sparse.csc_matrix, scipy.sparse.issparse,
lambda a: dense_from_sparse(a*2.),
lambda a: numpy.asarray((a*2).todense()),
scipy.sparse.csr_matrix)
shared_constructor_ = theano.sparse.shared,
dtype_ = 'float64',
get_value_borrow_true_alias_ = True,
shared_borrow_true_alias_ = True,
set_value_borrow_true_alias_ = True,
set_value_inplace_ = False,
set_casted_value_inplace_ = False,
shared_constructor_accept_ndarray_ = False,
internal_type_ = scipy.sparse.csc_matrix,
test_internal_type_ = scipy.sparse.issparse,
theano_fct_ = lambda a: dense_from_sparse(a*2.),
ref_fct_ = lambda a: numpy.asarray((a*2).todense()),
cast_value_ = scipy.sparse.csr_matrix)
if __name__ == '__main__':
unittest.main()
......@@ -3538,8 +3538,16 @@ tilegrad = TileGrad()
class Tile(Op):
"""Tiles its input according to reps. Reps is of same dimension as x
and contains the number of times to tile x in each dimension"""
"""
Construct an array by repeating the input x according to reps pattern.
Tiles its input according to reps. The len of reps is the number of
dimension of x and contains the number of times to tile x in each dimension.
:see: `numpy.tile http://docs.scipy.org/doc/numpy/reference/generated/numpy.tile.html`_
"""
def __init__(self, ndim):
self.ndim = ndim
def __eq__(self, other):
......@@ -4273,7 +4281,7 @@ def grad(cost, wrt, g_cost=None, consider_constant=[], warn_type=False):
each element of the list. If an element of `wrt` is not differentiable
with respect to the output, then a zero variable is returned.
This function is a wrapper around a the more general function
This function is a wrapper around the more general function
`theano.gradient.grad_sources_inputs``.
"""
......
......@@ -941,12 +941,16 @@ class CAReduce(Op):
# If it's a zero-size array, use scalar_op.identity if available
if variable.shape[dimension] == 0:
if hasattr(self.scalar_op, 'identity'):
variable = self.scalar_op.identity
variable = numpy.array(self.scalar_op.identity)
break
else:
raise ValueError("Input (%s) has zero-size on axis %s, but self.scalar_op (%s) has no attribute 'identity'" % (variable, dimension, self.scalar_op))
else:
variable = self.ufunc.reduce(variable, dimension)
variable = numpy.asarray(variable)
if numpy.may_share_memory(variable, input):
# perhaps numpy is clever for reductions of size 1? We don't want this.
variable = variable.copy()
output[0] = theano._asarray(variable, dtype = node.outputs[0].type.dtype)
else:
output[0] = numpy.copy(variable)
......@@ -1169,27 +1173,79 @@ class Prod(CAReduce):
).get(idtype, idtype)
def grad(self, (prod_in, ), (gz, )):
'''
The grad of this Op could be very easy, it is was not for the case
where zeros are present in a given "group" (ie. elements reduced
together to form the product).
If no zeros are found in the elements of the product, then the
partial derivative of the product relative to one of the elements
(one of the inputs) is simply the product of the other elements.
That's easy to see from the chain rule.
Now the trick (with no zeros) is to take the overall product, then
for every original element, the partial derivative is given by
this product divided by the element itself (which equals the product
of the other terms). This is easy to do by broadcasting the original
product.
(Note that we also need to broadcast-multiply by the "incoming gradient",
ie. the gradient of the cost relative to the output/product).
-----
With zeros, things get more complicated. For a given group, we have 3
cases:
* No zeros in the group. Use previous trick.
* If only one zero is present, then the gradient for that element is
non-zero, but is zero for all others.
* If more than one zero is present, then all the derivatives are zero.
For the last two cases (with 1 or more zeros), we can't use the division
trick, as this gives divisions by 0.
Implementing that case-by-case logic is not as trivial, so a bunch of
hacks are piled down here to do it. Notably, for the "only one zero"
case, there's a special Op that computes the product of the elements
in the group, minus the zero (see ProdWithoutZero). The trick is then
to use the division trick for groups with no zero, to use the
ProdWithoutZeros op where there's only one zero, and to output a
derivative of zero for any element part of a group with more than
one zero.
I do this by first counting the number of zeros in each group (see
the "T.eq()" bits), then taking this or that behavior (see T.switch)
based on the result of this count.
'''
if prod_in.dtype[0:3] in ('int','uin'):
return [None]
# Prepare the broadcasting that is used everywhere to broadcast
# over the original groups (ie. broadcast over the elements of a given
# product)
gz = as_tensor_variable(gz)
axis = self.axis
if axis is None:
axis = range(prod_in.type.ndim)
if axis == ():
return gz,
new_dims = []
new_dims = []
i = 0
for j, _ in enumerate(prod_in.type.broadcastable):
if j in axis:
new_dims.append('x')
else:
new_dims.append(i)
i += 1
i += 1
# result of the product, broadcastable over groups
prod_out = self(prod_in).dimshuffle(new_dims)
# incoming gradient, broadcastable over groups
gz = gz.dimshuffle(new_dims)
# division trick if we don't have zeros. This will contain
# NaNs to be eliminated in the T.switch if we do have zeros.
grad_case_without_zeros = (gz * prod_out / prod_in)
if self.no_zeros_in_input:
......@@ -1198,13 +1254,22 @@ class Prod(CAReduce):
else:
T = theano.tensor
where_zeros = T.eq(prod_in, 0.0)
where_zeros = T.eq(prod_in, 0.0)
sum_where_zeros = T.sum(where_zeros, axis=self.axis)
groups_with_single_zero = T.eq(sum_where_zeros, 1.0).dimshuffle(new_dims)
groups_with_single_zero = T.eq(sum_where_zeros, 1).dimshuffle(new_dims)
# tensor with 0 everywhere except for those places where
# a 0 part of a group with a single zero was to be found
where_single_zero = groups_with_single_zero * where_zeros
where_gz_not_zero = T.neq(gz, 0.0)
# further optimization to avoid computing ProdWithoutZeros
# if the incoming gradient is 0
where_gz_not_zero = T.neq(gz, 0.0)
# only take ProdWithoutZeros for the groups with single zeros
# with non-null incoming gradient
where_to_take_prod_without_zeros = \
groups_with_single_zero * where_gz_not_zero
# preprocess the original input so that we set 0 everywhere
# except for groups that contain a single zero, to avoid computing
# multiplications on other groups
prod_without_zeros_in = where_to_take_prod_without_zeros * prod_in
# TODO: put lazy switch here, if it'd work
# this is pretty efficient already (no multiplication if 0), but
......@@ -1212,7 +1277,8 @@ class Prod(CAReduce):
prod_without_zeros = ProdWithoutZeros(axis=self.axis)(prod_without_zeros_in)
prod_without_zeros = prod_without_zeros.dimshuffle(new_dims)
groups_without_zeros = T.eq(sum_where_zeros, 0.0).dimshuffle(new_dims)
groups_without_zeros = T.eq(sum_where_zeros, 0).dimshuffle(new_dims)
final_grad = T.switch(groups_without_zeros, grad_case_without_zeros,
T.switch(where_single_zero, prod_without_zeros, 0.0) * gz)
......@@ -1228,19 +1294,28 @@ class Prod(CAReduce):
return ()
class MulWithoutZeros(scalar.BinaryScalarOp):
identity = 1.
# "identity" here is zero, as in Reduce we don't want to start
# with reducing (1, something_else): this leads to the erronous
# case where a vector of zeros is reduced by binary reductions
# of (1, 0), which always ends up as 1 (ie. the result for
# the c version, for the product of [0,0,0], is 1.0)
identity = 0.
commutative = True
associative = True
def impl(self, *inputs):
if inputs[0] == 0.:
return inputs[1]
if inputs[1] == 0.:
return inputs[0]
return inputs[1] * inputs[2]
def impl(self, x, y):
if x == 0:
return y
if y == 0:
return x
return x*y
def c_code(self, node, name, (x,y), (z, ), sub):
return ("%(z)s = ((%(x)s == 0) ? (%(y)s) : " + \
"((%(y)s == 0) ? (%(x)s) : ((%(y)s)*(%(x)s))) );") % locals()
def c_code_cache_version(self):
return (1,)
mul_without_zeros = MulWithoutZeros(scalar.upcast_out, name = 'mul_without_zeros')
class ProdWithoutZeros(CAReduce):
......@@ -1263,4 +1338,3 @@ class ProdWithoutZeros(CAReduce):
return "ProdWithoutZeros"
else:
return "ProdWithoutZeros{%s}" % ", ".join(map(str, self.axis))
......@@ -595,5 +595,5 @@ def computeH(V,W,b,d):
return H
from . import ConvGrad3D
from . import ConvTransp3D
import ConvGrad3D
import ConvTransp3D
......@@ -90,6 +90,10 @@ def broadcast_like(value, template, env):
if template not in shape_of:
raise NotImplementedError('broadcast_like currently requires the template Variable to be in the env already')
rval = T.alloc(T.cast(value, template.dtype), *shape_of[template])
# the template may have 1s in its shape without being broadcastable
if rval.broadcastable != template.broadcastable:
rval = T.unbroadcast(rval, *[i for i in xrange(rval.ndim) if rval.broadcastable[i]
and not template.broadcastable[i]])
assert rval.type == template.type
return rval
......@@ -663,14 +667,20 @@ def local_fill_to_alloc(node):
elif v.type.broadcastable == node.outputs[0].type.broadcastable:
# this is a cast
rval = [T.cast(v, node.outputs[0].type.dtype)]
elif r.type.broadcastable == node.outputs[0].type.broadcastable:
# we are broadcasting v somehow, but not r
rval = [broadcast_like(v, r, node.env)]
else:
# we are broadcasting v somehow
shape_of = node.env.shape_feature.shape_of
# we are broadcasting both v and r,
# the output shape must be computed
#
# TODO: implement this case (including a test!)
#
# I think the strategy should be to extend the shorter shape vector
# with 1s (how?) and then take the elementwise max of the two.
# - how to flag an error of shape mismatch where broadcasting should be illegal?
return
# TODO: cut out un-necessary dimshuffles of v
rval = [T.alloc(T.cast(v, node.outputs[0].dtype), *shape_of[node.outputs[0]])]
#if rval[0].type != node.outputs[0].type:
#print >> sys.stderr, theano.printing.debugprint(node.outputs[0], file='str')
assert rval[0].type == node.outputs[0].type, ('rval', rval[0].type,
'orig', node.outputs[0].type,
......@@ -764,10 +774,12 @@ def local_subtensor_make_vector(node):
@gof.local_optimizer([T.Elemwise])
def local_useless_elemwise(node):
"""
eq(x,x) -> 1
neq(x,x) -> 0
mul(x) -> x
add(x) -> x
eq(x,x) -> 1
neq(x,x) -> 0
mul(x) -> x
add(x) -> x
identity(x) -> x
"""
if isinstance(node.op, T.Elemwise):
......@@ -783,6 +795,8 @@ add(x) -> x
return [node.inputs[0]]
if node.op.scalar_op == theano.scalar.add and len(node.inputs)==1:
return [node.inputs[0]]
if node.op.scalar_op == theano.scalar.identity and len(node.inputs)==1:
return [node.inputs[0]]
@register_specialize
......@@ -2255,8 +2269,7 @@ def local_mul_specialize(node):
neg ^= True #toggles
elif N.all(y == 0.0):
# if we find any zero, we just return right away
return [T.alloc(numpy.asarray(0, dtype=node.outputs[0].dtype),
*node.env.shape_feature.shape_of[node.outputs[0]])]
return [broadcast_like(0, node.outputs[0], node.env)]
else:
new_inputs.append(input)
......@@ -2273,21 +2286,14 @@ def local_mul_specialize(node):
else:
rval = T.mul(*new_inputs)
return [T.alloc(T.cast(rval, node.outputs[0].dtype),
*node.env.shape_feature.shape_of[node.outputs[0]])]
return [broadcast_like(rval, node.outputs[0], node.env)]
else:
# there are no variable inputs to mul
# N.B. this could have been constant-folded...
if neg:
# return output's worth of -1
return [T.alloc(
numpy.asarray(-1, dtype=node.outputs[0].dtype),
*node.env.shape_feature.shape_of[node.outputs[0]])]
return [broadcast_like(-1, node.outputs[0], node.env)]
else:
# return output's worth of 1
return [T.alloc(
numpy.asarray(1, dtype=node.outputs[0].dtype),
*node.env.shape_feature.shape_of[node.outputs[0]])]
return [broadcast_like(1, node.outputs[0], node.env)]
register_specialize(local_mul_specialize)
......
"""
This file implement specialization optimization that break the canonicalization form
This file implement specialization optimization that break the canonization form of the graph.
Currently their is problem with the order of optimization and the definition of definition of canonized graph.
Right now their is a canonization optimization phase that try to make all equivalent graph identical. This is not always the case, but it do many of the basic stuff canonical. We need to extend the definition of canonization to make this true more often.
The problem this file indent to fix in the future is that in the "Equilibrium" specialization optimization phase, there is optimization that request that the graph is canonical, some other request that this is not true, and some other that break the canonicalization for some optimization. As we can't control the order of those optimization, their is case that some optimization requesting a canonical graph won't be applied as optimization that break the canonicalization form of the graph executed before.
To fix this, we need to split the specialization phase into a phase where optimization can't break the canonicalization form and one where this is allowed. This is also needed for the stabilized optimization phase, but as it happen before the specialization phase, this cause less problem.
Also, we should make the env refuse optimization that break the canonization of the graph in the optimizations phases where the graph is supposed to be canonical.
"""
# TODO: intelligent merge for mul/add
......@@ -30,7 +40,7 @@ from theano import scalar as scal
class MaxAndArgmaxOptimizer(Optimizer):
"""Replace MaxAndArgmax by CAReduce when the argmax is not used
This is faster as MaxAndArgmax don't have c code and execute it
This is faster as MaxAndArgmax don't have c code and execute it
in two pass.
"""
......@@ -70,7 +80,7 @@ def local_max_to_min(node):
This is tested in tensor/tests/test_basic.py:test_min_max
:note: we don't need an opt that will do the reverse as by default
:note: we don't need an opt that will do the reverse as by default
the interface put only MaxAndArgmax into the graph.
"""
if node.op == T.neg and node.inputs[0].owner:
......@@ -81,5 +91,3 @@ def local_max_to_min(node):
return [CAReduce(scal.minimum,max.owner.op.axis)(neg.owner.inputs[0])]
return False
......@@ -104,9 +104,9 @@ class test_Broadcast(unittest.TestCase):
xv = numpy.asarray(numpy.random.rand(*xsh))
yv = numpy.asarray(numpy.random.rand(*ysh))
zv = xv + yv
f(xv, yv)
assert xv.shape==zv.shape
def test_perform(self):
......@@ -217,11 +217,11 @@ class test_CAReduce(unittest.TestCase):
f(xv)
except ValueError:
pass
else:
else:
self.fail()
else:
self.failUnless((numpy.abs(f(xv) - zv) < 1e-10).all())
#test CAReduce.infer_shape
#the Shape op don't implement c_code!
......@@ -248,7 +248,7 @@ class test_CAReduce(unittest.TestCase):
self.with_linker(gof.CLinker(), maximum)
self.with_linker(gof.CLinker(), minimum)
#need other dtype then real
#need other dtype then real
#no c_code for or_, and_
#self.with_linker(gof.CLinker(), or_)
#self.with_linker(gof.CLinker(), and_)
......@@ -258,23 +258,28 @@ class test_Prod(unittest.TestCase):
def setUp(self):
unittest_tools.seed_rng()
# we want to allow nans in the matrices, so we disable this DEBUG_MODE check
mode = theano.compile.mode.get_default_mode()
mode = copy(mode)
mode.check_isfinite = False
self.mode = mode
def test_verify_grad(self):
# including zeros, as the case with zeros is important
# (and special cases: 1 zero in the row, more than 1 zero in the row)
x_val = numpy.asarray([[1,2,3],[4,5,6],[7,8,9]], dtype='float32')
x = theano.tensor.dmatrix()
# now with verify_grad
unittest_tools.verify_grad(Prod(axis=1), [x_val])
unittest_tools.verify_grad(Prod(axis=1), [x_val], mode=self.mode)
# second time, with some added complexity
# verify_grad takes the sum of the matrices anyway
def fn(x2):
return theano.tensor.sqr(Prod(axis=1)(x2))
unittest_tools.verify_grad(fn, [x_val])
unittest_tools.verify_grad(fn, [x_val], mode=self.mode)
def test_verify_grad_with_zeros(self):
......@@ -287,18 +292,18 @@ class test_Prod(unittest.TestCase):
x2 = theano.tensor.dmatrix()
p = Prod(axis=1)(x)
p2 = Prod(axis=1)(x2)
fn = theano.function([x,x2],[p-p2])
fn = theano.function([x,x2],[p-p2], mode=self.mode)
#print "hand computed diff for each row"
x2_val = numpy.asarray([[1., 2., 3.003], [0.003,5.,6], [0.,0.,9.01]])
#print fn(x_val, x2_val)
fn2 = theano.function([x],[theano.tensor.grad(p.sum(),x)])
fn2 = theano.function([x],[theano.tensor.grad(p.sum(),x)], mode=self.mode)
#print "real grad"
#print fn2(x_val)
fn3 = theano.function([x],[p])
fn3 = theano.function([x],[p], mode=self.mode)
assert numpy.allclose(fn3(x_val), [6.,0.,0.])
# now with verify_grad
unittest_tools.verify_grad(Prod(axis=1), [x_val])
unittest_tools.verify_grad(Prod(axis=1), [x_val], mode=self.mode)
# second time, with some added complexity
# verify_grad takes the sum of the matrices anyway
......@@ -318,11 +323,11 @@ class test_Prod(unittest.TestCase):
x = theano.tensor.dmatrix()
x_val = numpy.array([[1,2,3],[0,5,6],[0,0,9]], dtype='float32')
pwz = ProdWithoutZeros(axis=1)(x)
fn = theano.function([x], pwz)
fn = theano.function([x], pwz, mode=self.mode)
assert numpy.allclose(fn(x_val), [6,30,9])
pwz_a0 = ProdWithoutZeros(axis=0)(x)
fn_a0 = theano.function([x], pwz_a0)
fn_a0 = theano.function([x], pwz_a0, mode=self.mode)
assert numpy.allclose(fn_a0(x_val), [1, 10, 162])
def test_other_grad_tests(self):
......@@ -333,24 +338,32 @@ class test_Prod(unittest.TestCase):
p = Prod(axis=1)
grad_p = theano.tensor.grad(p(x).sum(), x)
grad_fn = theano.function([x], grad_p)
grad_fn = theano.function([x], grad_p, mode=self.mode)
assert numpy.allclose(grad_fn(x_val1), [[6.,3.,2.],[30.,0.,0.],[0.,0.,0.]])
assert numpy.allclose(grad_fn(x_val2), [[0., 0., 2.], [30., 0., 0.], [72., 63., 56.], [0., 0., 90.]])
p_axis0 = Prod(axis=0)
grad_p_axis0 = theano.tensor.grad(p_axis0(x).sum(), x)
grad_fn_axis0 = theano.function([x], grad_p_axis0)
grad_fn_axis0 = theano.function([x], grad_p_axis0, mode=self.mode)
assert numpy.allclose(grad_fn_axis0(x_val2), [[0., 400., 0.],[63., 160., 0.], [0., 100., 0.], [0., 80., 0.]])
tensor.verify_grad(p, [x_val1], rng=rng)
tensor.verify_grad(p, [x_val1], rng=rng, mode=self.mode)
def test_mul_without_zeros_zeros(self):
a = numpy.zeros((3,3))
x = theano.tensor.dmatrix()
mul1 = ProdWithoutZeros(axis=0)(x)
fn_debug = theano.function([x], mul1, mode=self.mode)
fn_debug(a)
if __name__ == '__main__':
unittest.main()
#suite = unittest.TestSuite([test_Prod('test_verify_grad')])
#unittest.main()
suite = unittest.TestSuite([test_Prod('test_mul_without_zeros_zeros')])
#suite.addTest(test_Prod('test_verify_grad_with_zeros'))
#suite.addTest(test_Prod('test_prod_without_zeros'))
#suite.addTest(test_Prod('test_other_grad_tests'))
#unittest.TextTestRunner().run(suite)
unittest.TextTestRunner().run(suite)
......@@ -1039,6 +1039,34 @@ class T_Scan(unittest.TestCase):
assert updates[b].type.ndim == b.type.ndim
def test_scan_as_tensor_on_gradients(self):
"""
Bug reported by cityhall on scan when computing the gradients
"""
to_scan = theano.tensor.dvector('to_scan')
seq = theano.tensor.dmatrix('seq')
f1 = theano.tensor.dscalar('f1')
def scanStep(prev, seq, f1):
return prev + f1 * seq
scanned, _ = theano.scan(fn = scanStep, \
sequences = [seq], \
outputs_info = [to_scan], \
non_sequences = [f1])
f_scan = theano.function(inputs=[to_scan, seq, f1], outputs=scanned)
f_scan([1,2,3], numpy.arange(12).reshape([4,3]), 1.)
t_grad = theano.tensor.grad(scanned.sum(), wrt=[to_scan, f1],
consider_constant=[seq])
f_grad = theano.function(inputs=[to_scan, seq, f1], outputs=t_grad)
f_scan([1,2,3], numpy.arange(12).reshape([4,3]), 1.)
f_grad([1,2,3], numpy.arange(12).reshape([4,3]), 1.)
if __name__ == '__main__':
unittest.main()
""" test code snipet in the Theano tutorials.
""" test code snippet in the Theano tutorials.
"""
import unittest
......
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论