提交 79fda719 authored 作者: Olivier Delalleau's avatar Olivier Delalleau

Merged (solved conflict in theano/configdefaults.py)

...@@ -24,27 +24,27 @@ instructions below for detailed installation steps): ...@@ -24,27 +24,27 @@ instructions below for detailed installation steps):
Python_ >= 2.4 Python_ >= 2.4
The development package (``python-dev`` or ``python-devel`` The development package (``python-dev`` or ``python-devel``
on most Linux distributions) is recommended (see just below). on most Linux distributions) is recommended (see just below).
``g++``, ``python-dev`` ``g++``, ``python-dev``
Not technically required but *highly* recommended, in order to compile Not technically required but *highly* recommended, in order to compile
generated C code. Theano `can` fall back on a NumPy-based Python execution generated C code. Theano `can` fall back on a NumPy-based Python execution
model, but a C compiler allows for vastly faster execution. model, but a C compiler allows for vastly faster execution.
`NumPy <http://numpy.scipy.org/>`_ >= 1.3.0 `NumPy <http://numpy.scipy.org/>`_ >= 1.3.0
Earlier versions have memory leaks. Earlier versions have memory leaks.
`SciPy <http://scipy.org>`_ `SciPy <http://scipy.org>`_
Only currently required for sparse matrix and special functions Only currently required for sparse matrix and special functions
support, but *highly* recommended. We recommend SciPy support, but *highly* recommended. We recommend SciPy
>=0.7 if you are using sparse matrices, because ``scipy.sparse`` >=0.7 if you are using sparse matrices, because ``scipy.sparse``
is buggy in 0.6 (the ``scipy.csc_matrix`` version of ``dot()`` has a is buggy in 0.6 (the ``scipy.csc_matrix`` version of ``dot()`` has a
bug with singleton dimensions, there may be more bugs). bug with singleton dimensions, there may be more bugs).
A `BLAS`_ installation (with Level 3 functionality) A `BLAS`_ installation (with Level 3 functionality)
Including the development headers (``-dev``, ``-devel``, depending on Including the development headers (``-dev``, ``-devel``, depending on
your Linux distribution). Mac OS X comes with the `Accelerate your Linux distribution). Mac OS X comes with the `Accelerate
framework`_ built in, and various options exist for Windows (see framework`_ built in, and various options exist for Windows (see
below). below).
.. _BLAS: http://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms .. _BLAS: http://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms
...@@ -55,14 +55,17 @@ The following libraries and software are optional: ...@@ -55,14 +55,17 @@ The following libraries and software are optional:
`nose <http://somethingaboutorange.com/mrl/projects/nose/>`_ `nose <http://somethingaboutorange.com/mrl/projects/nose/>`_
Recommended, to run Theano's test-suite. Recommended, to run Theano's test-suite.
`Sphinx <http://sphinx.pocoo.org/>`_ >= 0.5.1, `pygments <http://pygments.org/>`_ `Sphinx <http://sphinx.pocoo.org/>`_ >= 0.5.1, `pygments <http://pygments.org/>`_
For building the documentation. LaTeX_ and dvipng_ are also necessary For building the documentation. LaTeX_ and dvipng_ are also necessary
for math to show up as images. for math to show up as images.
`Mercurial <http://mercurial.selenic.com/>`_ `Mercurial <http://mercurial.selenic.com/>`_
To download bleeding-edge versions of Theano. To download bleeding-edge versions of Theano.
`NVIDIA CUDA drivers and SDK`_ `NVIDIA CUDA drivers and SDK`_
Required for GPU code generation/execution. Only NVIDIA GPUs using Required for GPU code generation/execution. Only NVIDIA GPUs using
32-bit floating point numbers are currently supported. 32-bit floating point numbers are currently supported.
.. _LaTeX: http://www.latex-project.org/ .. _LaTeX: http://www.latex-project.org/
.. _dvipng: http://savannah.nongnu.org/projects/dvipng/ .. _dvipng: http://savannah.nongnu.org/projects/dvipng/
...@@ -77,7 +80,7 @@ Basic user install instructions ...@@ -77,7 +80,7 @@ Basic user install instructions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The easiest way to obtain the released version of Theano is from PyPI using The easiest way to obtain the released version of Theano is from PyPI using
pip_ (a replacement for easy_install_ provided by setuptools_/distribute_) pip_ (a replacement for easy_install_ provided by setuptools_/distribute_)
by typing by typing
.. code-block:: bash .. code-block:: bash
...@@ -111,7 +114,7 @@ directory; see the `virtualenv documentation`_ for details. ...@@ -111,7 +114,7 @@ directory; see the `virtualenv documentation`_ for details.
``easy_install`` such as more intelligent dependency management, better ``easy_install`` such as more intelligent dependency management, better
error messages and a ``pip uninstall`` command for easily removing error messages and a ``pip uninstall`` command for easily removing
packages. packages.
If you do not have ``pip`` installed but do have ``easy_install``, you can If you do not have ``pip`` installed but do have ``easy_install``, you can
get ``pip`` by simply typing ``easy_install pip``. get ``pip`` by simply typing ``easy_install pip``.
...@@ -320,7 +323,7 @@ correctly (for example, for MKL this might be ``-lmkl -lguide -lpthread`` or ...@@ -320,7 +323,7 @@ correctly (for example, for MKL this might be ``-lmkl -lguide -lpthread`` or
.. note:: .. note::
Make sure your BLAS Make sure your BLAS
libraries are available as dynamically-loadable libraries. libraries are available as dynamically-loadable libraries.
ATLAS is often installed only as a static library. Theano is not able to ATLAS is often installed only as a static library. Theano is not able to
use this static library. Your ATLAS installation might need to be modified use this static library. Your ATLAS installation might need to be modified
to provide dynamically loadable libraries. (On Linux this to provide dynamically loadable libraries. (On Linux this
...@@ -334,8 +337,8 @@ correctly (for example, for MKL this might be ``-lmkl -lguide -lpthread`` or ...@@ -334,8 +337,8 @@ correctly (for example, for MKL this might be ``-lmkl -lguide -lpthread`` or
Mac Mac
--- ---
- If the above required libraries are not already installed on your Mac, one option is first, to - If the above required libraries are not already installed on your Mac,
install `MacPorts <http://www.macports.org/>`__. one option is first, to install `MacPorts <http://www.macports.org/>`__.
- Then, in order to install one or more of the required libraries, use "port install", e.g. as follows: - Then, in order to install one or more of the required libraries, use "port install", e.g. as follows:
...@@ -359,7 +362,7 @@ Mac ...@@ -359,7 +362,7 @@ Mac
reason this is necessary is because you might have an Apple-provided python reason this is necessary is because you might have an Apple-provided python
(via, for example, an Xcode installation). After performing this step, you (via, for example, an Xcode installation). After performing this step, you
should check that the symbolic link provided by ``which python`` points to should check that the symbolic link provided by ``which python`` points to
the MacPorts python. For instance, on Snow Leopard with the latest MacPorts, the MacPorts python. For instance, on Snow Leopard with the latest MacPorts,
the output of ``which python`` is ``/opt/local/bin/python`` and this symbolic the output of ``which python`` is ``/opt/local/bin/python`` and this symbolic
link points to ``/opt/local/bin/python2.6``. When executing ``sudo link points to ``/opt/local/bin/python2.6``. When executing ``sudo
python_select python26-apple`` (which you should **not** do), the link python_select python26-apple`` (which you should **not** do), the link
...@@ -376,7 +379,7 @@ Mac ...@@ -376,7 +379,7 @@ Mac
- Please follow the same procedure with ``numpy``. - Please follow the same procedure with ``numpy``.
- Put ``export PYTHONPATH=/opt/local/lib/python2.6/site-packages:$PYTHONPATH`` - Put ``export PYTHONPATH=/opt/local/lib/python2.6/site-packages:$PYTHONPATH``
in your ``.bashrc`` in order to include your MacPorts Python packages in your ``.bashrc`` in order to include your MacPorts Python packages
(NumPy, SciPy) in Python's path. (NumPy, SciPy) in Python's path.
- Make sure that the gcc version that you have installed on your system is - Make sure that the gcc version that you have installed on your system is
...@@ -408,7 +411,7 @@ Mac ...@@ -408,7 +411,7 @@ Mac
- An obscure ``Bus error`` can sometimes be caused when linking - An obscure ``Bus error`` can sometimes be caused when linking
Theano-generated object files against the ``framework`` library in Leopard. Theano-generated object files against the ``framework`` library in Leopard.
For this reason, we've disabled linking with ``-framework Python``, since on For this reason, we've disabled linking with ``-framework Python``, since on
most configurations this solves the ``Bus error`` problem. If this default most configurations this solves the ``Bus error`` problem. If this default
configuration causes problems with your Python/Theano installation and you think configuration causes problems with your Python/Theano installation and you think
that linking with ``-framework Python`` might help, then either set that linking with ``-framework Python`` might help, then either set
...@@ -421,9 +424,10 @@ Mac ...@@ -421,9 +424,10 @@ Mac
mac_framework_link=True mac_framework_link=True
Please infom us if you have trouble installing and running Theano on your mac. Please infom us if you have trouble installing and running Theano on your mac.
We would be especially interested in dependencies that we missed listing, as well as tests We would be especially interested in dependencies that we missed
that fail on your platform (use the ``theano-users@googlegroups.com`` mailing list, listing, as well as tests that fail on your platform (use the
but note that you must first register to it, by going to `theano-users`_). ``theano-users@googlegroups.com`` mailing list, but note that you must
first register to it, by going to `theano-users`_).
Windows Windows
...@@ -533,7 +537,7 @@ used within a MinGW Shell (not available if you only installed Python(x,y)). ...@@ -533,7 +537,7 @@ used within a MinGW Shell (not available if you only installed Python(x,y)).
to create under Windows) in your user profile directory (the directory you to create under Windows) in your user profile directory (the directory you
are into when you start a new command prompt with ``cmd``), containing the are into when you start a new command prompt with ``cmd``), containing the
following two lines: following two lines:
.. code-block:: cfg .. code-block:: cfg
[blas] [blas]
...@@ -564,7 +568,7 @@ used within a MinGW Shell (not available if you only installed Python(x,y)). ...@@ -564,7 +568,7 @@ used within a MinGW Shell (not available if you only installed Python(x,y)).
error of the type ``"Not enough storage is available to error of the type ``"Not enough storage is available to
process this command"``): one workaround is to run nosetests process this command"``): one workaround is to run nosetests
multiple times under individual subdirectories. multiple times under individual subdirectories.
Compiling a faster BLAS Compiling a faster BLAS
~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~
...@@ -598,7 +602,7 @@ top of the MinGW installation included within Python(x,y), as follows: ...@@ -598,7 +602,7 @@ top of the MinGW installation included within Python(x,y), as follows:
- In a prompt (``cmd``), install MSYS with - In a prompt (``cmd``), install MSYS with
.. code-block:: bash .. code-block:: bash
mingw-get install msys-base mingw-get install msys-base
- Edit ``pythonxy\mingw\msys\1.0\msys.bat`` (e.g. in Wordpad) and add as first - Edit ``pythonxy\mingw\msys\1.0\msys.bat`` (e.g. in Wordpad) and add as first
...@@ -619,7 +623,7 @@ follows: ...@@ -619,7 +623,7 @@ follows:
a) Download `ActivePerl <http://www.activestate.com/activeperl/downloads>`_ and a) Download `ActivePerl <http://www.activestate.com/activeperl/downloads>`_ and
install it (other Perl interpreters should also work, but were not install it (other Perl interpreters should also work, but were not
tested). tested).
b) Unpack GotoBLAS2, either using `7-zip <http://www.7-zip.org/>`__ or in b) Unpack GotoBLAS2, either using `7-zip <http://www.7-zip.org/>`__ or in
a shell with: a shell with:
...@@ -633,7 +637,7 @@ follows: ...@@ -633,7 +637,7 @@ follows:
.. code-block:: bash .. code-block:: bash
quickbuild.win32 1>log.txt 2>err.txt quickbuild.win32 1>log.txt 2>err.txt
Compilation should take a few minutes. Afterwards, you will probably Compilation should take a few minutes. Afterwards, you will probably
find many error messages in err.txt, but there should be an ``exports`` find many error messages in err.txt, but there should be an ``exports``
...@@ -695,20 +699,21 @@ use a compilation directory located somewhere else: ...@@ -695,20 +699,21 @@ use a compilation directory located somewhere else:
base_compiledir=path_to_a_directory_without_such_characters base_compiledir=path_to_a_directory_without_such_characters
Then Then
1) Install CUDA driver (32-bit on 32-bit Windows, idem for 64-bit). 1) Install CUDA driver (32-bit on 32-bit Windows, idem for 64-bit).
2) Install CUDA toolkit 32-bit (even if you computer is 64-bit, 2) Install CUDA toolkit 32-bit (even if you computer is 64-bit,
must match the Python installation version). must match the Python installation version).
3) Install CUDA SDK 32-bit. 3) Install CUDA SDK 32-bit.
4) Test some pre-compiled example of the sdk. 4) Test some pre-compiled example of the sdk.
5) Download Visual Studio 2008 Express (free, VS2010 not supported by nvcc 3.1, 5) Download Visual Studio 2008 Express (free, VS2010 not supported by nvcc 3.1,
VS2005 is not available for download but supported by nvcc, the non free version should work too). VS2005 is not available for download but supported by nvcc, the non
free version should work too).
6) Follow the instruction in the GettingStartedWindows.pdf file from the CUDA web 6) Follow the instruction in the GettingStartedWindows.pdf file from the CUDA web
site to compile CUDA code with VS2008. If that does not work, you will site to compile CUDA code with VS2008. If that does not work, you will
not be able to compile GPU code with Theano. not be able to compile GPU code with Theano.
...@@ -729,7 +734,7 @@ Then ...@@ -729,7 +734,7 @@ Then
9) Then run the Theano CUDA test files with nosetests from the 9) Then run the Theano CUDA test files with nosetests from the
``theano/sandbox/cuda/tests`` subdirectory. In the current version of ``theano/sandbox/cuda/tests`` subdirectory. In the current version of
Theano, this should fail with an error like: Theano, this should fail with an error like:
.. code-block:: bash .. code-block:: bash
NVCC: nvcc fatal: Don't know what to do with NVCC: nvcc fatal: Don't know what to do with
......
...@@ -156,7 +156,7 @@ This is primarily for internal debugging, not for typical use. ...@@ -156,7 +156,7 @@ This is primarily for internal debugging, not for typical use.
For the transparent use of different type of optimization Theano can make, For the transparent use of different type of optimization Theano can make,
there is the policy that get_value() always return by default the same object type there is the policy that get_value() always return by default the same object type
it received when the shared variable was created. So if you created manually data on it received when the shared variable was created. So if you created manually data on
the gpu and create a shared variable on the gpu with this data, get_value will always the gpu and create a shared variable on the gpu with this data, get_value will always
return gpu data event when return_internal_type=False. return gpu data event when return_internal_type=False.
......
...@@ -6,6 +6,7 @@ from StringIO import StringIO ...@@ -6,6 +6,7 @@ from StringIO import StringIO
import numpy import numpy
import theano
from theano import gof from theano import gof
from theano.gof import Env, graph, utils, link from theano.gof import Env, graph, utils, link
from theano.gof.link import WrapLinkerMany, raise_with_op from theano.gof.link import WrapLinkerMany, raise_with_op
...@@ -536,6 +537,9 @@ def _check_inputs(node, storage_map, r_vals, dr_vals, active_nodes, clobber_dr_v ...@@ -536,6 +537,9 @@ def _check_inputs(node, storage_map, r_vals, dr_vals, active_nodes, clobber_dr_v
# But this depend on the version of numpy! # But this depend on the version of numpy!
if getattr(out_var,'size',2)==1: if getattr(out_var,'size',2)==1:
continue continue
if isinstance(node.op, theano.compile.mode.OutputGuard):
# This class is not in the final graph.
continue
if not _may_share_memory(out_var, in_var): if not _may_share_memory(out_var, in_var):
#when a subtensor return a tensor of ndim==0, numpy seam to return a copy. #when a subtensor return a tensor of ndim==0, numpy seam to return a copy.
#when have an empty ndarray(happen with output guard) it is not the same. why? #when have an empty ndarray(happen with output guard) it is not the same. why?
......
...@@ -24,7 +24,10 @@ AddConfigVar('device', ...@@ -24,7 +24,10 @@ AddConfigVar('device',
) )
AddConfigVar('init_gpu_device', AddConfigVar('init_gpu_device',
"Initialize the gpu device to use. This don't change the default behavior. We don't default to try to move the computation to it. We don't default to put shared variable of float32 on it. Useful to run the test on a specific gpu.", ("Initialize the gpu device to use, works only if device=cpu. "
"Unlike 'device', setting this option will NOT move computations, "
"nor shared variables, to the specified GPU. "
"It can be used to run GPU-specific tests on a particular GPU."),
EnumStr('', 'gpu0', 'gpu1', 'gpu2', 'gpu3', EnumStr('', 'gpu0', 'gpu1', 'gpu2', 'gpu3',
allow_override=False) allow_override=False)
) )
......
...@@ -104,7 +104,8 @@ if cuda_available: ...@@ -104,7 +104,8 @@ if cuda_available:
cuda_available = False cuda_available = False
cuda_initialization_error_message = e.message cuda_initialization_error_message = e.message
# We must do those import to be able to create the full doc when nvcc # We must do those import to be able to create the full doc when
# nvcc is not available
from theano.sandbox.cuda.var import (CudaNdarrayVariable, from theano.sandbox.cuda.var import (CudaNdarrayVariable,
CudaNdarrayConstant, CudaNdarrayConstant,
CudaNdarraySharedVariable, CudaNdarraySharedVariable,
...@@ -115,7 +116,12 @@ if cuda_available: ...@@ -115,7 +116,12 @@ if cuda_available:
#check if their is an old cuda_ndarray that was loading instead of the one we compiled! #check if their is an old cuda_ndarray that was loading instead of the one we compiled!
import cuda_ndarray.cuda_ndarray import cuda_ndarray.cuda_ndarray
if cuda_ndarray_so != cuda_ndarray.cuda_ndarray.__file__: if cuda_ndarray_so != cuda_ndarray.cuda_ndarray.__file__:
warning("WARNING: cuda_ndarray was loaded from",cuda_ndarray.cuda_ndarray.__file__,"This is not expected as theano should compile it automatically for you. Do you have a directory called cuda_ndarray in your LD_LIBRARY_PATH environment variable? If so, please remove it as it is outdated!") warning("WARNING: cuda_ndarray was loaded from",
cuda_ndarray.cuda_ndarray.__file__,
"""This is not expected as theano should compile it
automatically for you. Do you have a directory called cuda_ndarray in your
LD_LIBRARY_PATH environment variable? If so, please remove it as it is
outdated!""")
shared_constructor = float32_shared_constructor shared_constructor = float32_shared_constructor
...@@ -204,8 +210,14 @@ def handle_shared_float32(tf): ...@@ -204,8 +210,14 @@ def handle_shared_float32(tf):
raise NotImplementedError('removing our handler') raise NotImplementedError('removing our handler')
if config.device.startswith('gpu'): if config.device.startswith('gpu'):
use(config.device, config.force_device) use(device=config.device, force=config.force_device)
elif config.init_gpu_device: elif config.init_gpu_device:
assert config.device=="cpu", "We can use the theano flags init_gpu_device only when the theano flags device=='cpu'" assert config.device=="cpu", "We can use the Theano flag init_gpu_device only when the Theano flag device=='cpu'"
print "Will init the gpu to use a specific gpu device. This don't default tomove computation and allocate shared variable of float32 to this device. For that try the theano flags device." warning(("GPU device %s will be initialized, and used if a GPU is needed. "
use(config.init_gpu_device, config.force_device, False, False) "However, no computation, nor shared variables, will be implicitly "
"moved to that device. If you want that behavior, use the 'device' "
"flag instead.") % config.init_gpu_device)
use(device=config.init_gpu_device,
force=config.force_device,
default_to_move_computation_to_gpu=False,
move_shared_float32_to_gpu=False)
...@@ -335,12 +335,14 @@ class GpuConv(Op): ...@@ -335,12 +335,14 @@ class GpuConv(Op):
^ hash(self.imshp) ^ hash(self.imshp)
def __str__(self): def __str__(self):
return '%s{%s, %s, %s, %s, %s}' %(self.__class__.__name__, return '%s{%s, %s, %s, %s, %s, %s, %s}' %(self.__class__.__name__,
self.border_mode, self.border_mode,
str(self.subsample), str(self.subsample),
str(self.logical_img_hw), str(self.logical_img_hw),
str(self.logical_kern_hw), str(self.logical_kern_hw),
str(self.logical_kern_align_top)) str(self.logical_kern_align_top),
str(self.imshp),
str(self.kshp))
def make_node(self, img, kern): def make_node(self, img, kern):
if img.type.ndim != 4: if img.type.ndim != 4:
......
...@@ -52,9 +52,9 @@ void * device_malloc(size_t size) ...@@ -52,9 +52,9 @@ void * device_malloc(size_t size)
#if COMPUTE_GPU_MEM_USED #if COMPUTE_GPU_MEM_USED
for(int i=0;i<TABLE_SIZE;i++){ for(int i=0;i<TABLE_SIZE;i++){
if(NULL==_alloc_size_table[i].ptr){ if(NULL==_alloc_size_table[i].ptr){
_alloc_size_table[i].ptr=rval; _alloc_size_table[i].ptr=rval;
_alloc_size_table[i].size=size; _alloc_size_table[i].size=size;
break; break;
} }
} }
_allocated_size += size; _allocated_size += size;
...@@ -88,16 +88,19 @@ int device_free(void *ptr) ...@@ -88,16 +88,19 @@ int device_free(void *ptr)
_outstanding_mallocs[0] -= (ptr != NULL); _outstanding_mallocs[0] -= (ptr != NULL);
#if COMPUTE_GPU_MEM_USED #if COMPUTE_GPU_MEM_USED
int i=0; int i=0;
size_t total_freed = 0;
for(;i<TABLE_SIZE;i++) for(;i<TABLE_SIZE;i++)
if(_alloc_size_table[i].ptr==ptr){ if(_alloc_size_table[i].ptr==ptr){
_allocated_size -= _alloc_size_table[i].size; _allocated_size -= _alloc_size_table[i].size;
_alloc_size_table[i].ptr=0; total_freed += _alloc_size_table[i].size;
_alloc_size_table[i].size=0; _alloc_size_table[i].ptr=0;
_alloc_size_table[i].size=0;
break;
break;
} }
if(i==TABLE_SIZE) if(i==TABLE_SIZE)
printf("Unallocated unknow size!\n"); printf("Unallocated unknow size!\n");
//fprintf(stderr, "freed %li bytes of device memory (%s). %d already allocated, ptr=%p\n", (long)total_freed, cudaGetErrorString(err),_allocated_size,ptr);
#endif #endif
return 0; return 0;
} }
...@@ -158,10 +161,10 @@ CudaNdarray_uninit(CudaNdarray*self) ...@@ -158,10 +161,10 @@ CudaNdarray_uninit(CudaNdarray*self)
} }
//make the rightmost coords change fastest //make the rightmost coords change fastest
//TODO: why does a downward for-loop not work???? //TODO: why does a downward for-loop not work????
//TODO: use the log2_dims and driver code to remove / and % //TODO: use the log2_dims and driver code to remove / and %
//TODO: skip the last division (when d == 0) //TODO: skip the last division (when d == 0)
#define decl_k_elemwise_unary_rowmajor(name, F) \ #define decl_k_elemwise_unary_rowmajor(name, F) \
__global__ void name (unsigned int numEls, \ __global__ void name (unsigned int numEls, \
unsigned int nd, \ unsigned int nd, \
...@@ -180,7 +183,7 @@ __global__ void name (unsigned int numEls, \ ...@@ -180,7 +183,7 @@ __global__ void name (unsigned int numEls, \
for (unsigned int _d = 0; _d < nd; ++_d) \ for (unsigned int _d = 0; _d < nd; ++_d) \
{ \ { \
unsigned int d = nd - _d-1; \ unsigned int d = nd - _d-1; \
/* i_d used to be unsigned, but their is a bug in nvcc 3.0. making it signed fix the bug.*/\ /* i_d used to be unsigned, but their is a bug in nvcc 3.0. making it signed fix the bug.*/\
int i_d = ii % dim[d]; /* i_d is our position in the d'th dimension */ \ int i_d = ii % dim[d]; /* i_d is our position in the d'th dimension */ \
ii = ii / dim[d]; \ ii = ii / dim[d]; \
a_i += i_d * a_str[d]; /* increment our a and z pointers by i_d elements */ \ a_i += i_d * a_str[d]; /* increment our a and z pointers by i_d elements */ \
...@@ -188,7 +191,7 @@ __global__ void name (unsigned int numEls, \ ...@@ -188,7 +191,7 @@ __global__ void name (unsigned int numEls, \
} \ } \
z_i[0] = F(a_i[0]); \ z_i[0] = F(a_i[0]); \
} \ } \
} }
template<typename T> __device__ T unary_copy(T a) { return a; } template<typename T> __device__ T unary_copy(T a) { return a; }
decl_k_elemwise_unary_rowmajor(k_elemwise_unary_rowmajor_copy, unary_copy<float>) decl_k_elemwise_unary_rowmajor(k_elemwise_unary_rowmajor_copy, unary_copy<float>)
...@@ -237,7 +240,7 @@ CudaNdarray_init(CudaNdarray *self, PyObject *args, PyObject *kwds) ...@@ -237,7 +240,7 @@ CudaNdarray_init(CudaNdarray *self, PyObject *args, PyObject *kwds)
PyObject *arr=NULL; PyObject *arr=NULL;
if (! PyArg_ParseTuple(args, "O", &arr)) if (! PyArg_ParseTuple(args, "O", &arr))
return -1; return -1;
if (! PyArray_Check(arr)) if (! PyArray_Check(arr))
{ {
PyErr_SetString(PyExc_TypeError, "PyArray arg required"); PyErr_SetString(PyExc_TypeError, "PyArray arg required");
...@@ -246,7 +249,7 @@ CudaNdarray_init(CudaNdarray *self, PyObject *args, PyObject *kwds) ...@@ -246,7 +249,7 @@ CudaNdarray_init(CudaNdarray *self, PyObject *args, PyObject *kwds)
int rval = CudaNdarray_CopyFromArray(self, (PyArrayObject*)arr); int rval = CudaNdarray_CopyFromArray(self, (PyArrayObject*)arr);
return rval; return rval;
} }
static PyMemberDef CudaNdarray_members[] = static PyMemberDef CudaNdarray_members[] =
{ {
/* /*
{"first", T_OBJECT_EX, offsetof(CudaNdarray, first), 0, {"first", T_OBJECT_EX, offsetof(CudaNdarray, first), 0,
...@@ -269,7 +272,7 @@ PyObject * CudaNdarray_CreateArrayObj(CudaNdarray * self) ...@@ -269,7 +272,7 @@ PyObject * CudaNdarray_CreateArrayObj(CudaNdarray * self)
PyObject * rval = PyArray_SimpleNew(self->nd, npydims, REAL_TYPENUM); PyObject * rval = PyArray_SimpleNew(self->nd, npydims, REAL_TYPENUM);
free(npydims); free(npydims);
if (!rval){ if (!rval){
return NULL; return NULL;
} }
assert (PyArray_ITEMSIZE(rval) == sizeof(real)); assert (PyArray_ITEMSIZE(rval) == sizeof(real));
return rval; return rval;
...@@ -310,7 +313,7 @@ PyObject * CudaNdarray_CreateArrayObj(CudaNdarray * self) ...@@ -310,7 +313,7 @@ PyObject * CudaNdarray_CreateArrayObj(CudaNdarray * self)
assert (PyArray_ITEMSIZE(rval) == sizeof(real)); assert (PyArray_ITEMSIZE(rval) == sizeof(real));
cublasGetVector(PyArray_SIZE(rval), sizeof(real), cublasGetVector(PyArray_SIZE(rval), sizeof(real),
contiguous_self->devdata, 1, contiguous_self->devdata, 1,
PyArray_DATA(rval), 1); PyArray_DATA(rval), 1);
CNDA_THREAD_SYNC; CNDA_THREAD_SYNC;
...@@ -339,13 +342,13 @@ PyObject* CudaNdarray_ZEROS(int n, int * dims) ...@@ -339,13 +342,13 @@ PyObject* CudaNdarray_ZEROS(int n, int * dims)
CudaNdarray* rval = (CudaNdarray*)CudaNdarray_new_null(); CudaNdarray* rval = (CudaNdarray*)CudaNdarray_new_null();
if (!rval) if (!rval)
{ {
PyErr_SetString(PyExc_RuntimeError, "CudaNdarray_Zeros: call to new_null failed"); PyErr_SetString(PyExc_RuntimeError, "CudaNdarray_ZEROS: call to new_null failed");
return NULL; return NULL;
} }
if (CudaNdarray_alloc_contiguous(rval, n, dims)) if (CudaNdarray_alloc_contiguous(rval, n, dims))
{ {
PyErr_SetString(PyExc_RuntimeError, "CudaNdarray_Zeros: allocation failed."); PyErr_SetString(PyExc_RuntimeError, "CudaNdarray_ZEROS: allocation failed.");
Py_DECREF(rval); Py_DECREF(rval);
return NULL; return NULL;
} }
...@@ -354,14 +357,14 @@ PyObject* CudaNdarray_ZEROS(int n, int * dims) ...@@ -354,14 +357,14 @@ PyObject* CudaNdarray_ZEROS(int n, int * dims)
//fprintf(stdout, "Sizeof: %d\n", total_size); //fprintf(stdout, "Sizeof: %d\n", total_size);
if (cudaSuccess != cudaMemset(rval->devdata, 0, total_size)) if (cudaSuccess != cudaMemset(rval->devdata, 0, total_size))
{ {
PyErr_Format(PyExc_MemoryError, "Error memsetting %d bytes of device memory.", total_size); PyErr_Format(PyExc_MemoryError, "CudaNdarray_ZEROS: Error memsetting %d bytes of device memory.", total_size);
Py_DECREF(rval); Py_DECREF(rval);
return NULL; return NULL;
} }
if (cnda_copy_structure_to_device(rval)) if (cnda_copy_structure_to_device(rval))
{ {
PyErr_SetString(PyExc_RuntimeError, "CudaNdarray_Zeros: syncing structure to device failed"); PyErr_SetString(PyExc_RuntimeError, "CudaNdarray_ZEROS: syncing structure to device failed");
Py_DECREF(rval); Py_DECREF(rval);
return NULL; return NULL;
} }
...@@ -544,7 +547,7 @@ PyObject * CudaNdarray_ReduceSum(CudaNdarray * self, PyObject * py_reduce_mask) ...@@ -544,7 +547,7 @@ PyObject * CudaNdarray_ReduceSum(CudaNdarray * self, PyObject * py_reduce_mask)
return (PyObject*)self_sum; return (PyObject*)self_sum;
} }
__global__ void k_copy_reshape_rowmajor(unsigned int numEls, __global__ void k_copy_reshape_rowmajor(unsigned int numEls,
unsigned int a_nd, const float * a_data, const int * a_dim, const int * a_str, unsigned int a_nd, const float * a_data, const int * a_dim, const int * a_str,
unsigned int z_nd, float * z_data, const int * z_dim, const int * z_str) unsigned int z_nd, float * z_data, const int * z_dim, const int * z_str)
{ {
...@@ -557,7 +560,7 @@ __global__ void k_copy_reshape_rowmajor(unsigned int numEls, ...@@ -557,7 +560,7 @@ __global__ void k_copy_reshape_rowmajor(unsigned int numEls,
unsigned int a_ii = i; unsigned int a_ii = i;
for (unsigned int _d = 0; _d < a_nd; ++_d) //make the rightmost coords change fastest for (unsigned int _d = 0; _d < a_nd; ++_d) //make the rightmost coords change fastest
{ {
unsigned int d = a_nd - _d-1; unsigned int d = a_nd - _d-1;
unsigned int a_i_d = a_ii % a_dim[d]; unsigned int a_i_d = a_ii % a_dim[d];
a_ii = a_ii / a_dim[d]; a_ii = a_ii / a_dim[d];
a_i += a_i_d * a_str[d]; a_i += a_i_d * a_str[d];
...@@ -566,7 +569,7 @@ __global__ void k_copy_reshape_rowmajor(unsigned int numEls, ...@@ -566,7 +569,7 @@ __global__ void k_copy_reshape_rowmajor(unsigned int numEls,
float * z_i = z_data; float * z_i = z_data;
for (unsigned int _d = 0; _d < z_nd; ++_d) //make the rightmost coords change fastest for (unsigned int _d = 0; _d < z_nd; ++_d) //make the rightmost coords change fastest
{ {
unsigned int d = z_nd - _d-1; unsigned int d = z_nd - _d-1;
//i tried to make the for loop count down, but it didn't work!? //i tried to make the for loop count down, but it didn't work!?
unsigned int z_i_d = z_ii % z_dim[d]; unsigned int z_i_d = z_ii % z_dim[d];
z_i += z_i_d * z_str[d]; z_i += z_i_d * z_str[d];
...@@ -608,7 +611,7 @@ PyObject * CudaNdarray_Reshape(CudaNdarray * self, PyObject * shape) ...@@ -608,7 +611,7 @@ PyObject * CudaNdarray_Reshape(CudaNdarray * self, PyObject * shape)
return NULL; return NULL;
} }
rval_size = rval_size * rval_dims[i]; rval_size = rval_size * rval_dims[i];
} }
}else{ }else{
rval_size = PyInt_AsLong(shape); rval_size = PyInt_AsLong(shape);
rval_dims[0] = rval_size; rval_dims[0] = rval_size;
...@@ -629,24 +632,24 @@ PyObject * CudaNdarray_Reshape(CudaNdarray * self, PyObject * shape) ...@@ -629,24 +632,24 @@ PyObject * CudaNdarray_Reshape(CudaNdarray * self, PyObject * shape)
{ {
//return a view, not a copy //return a view, not a copy
CudaNdarray * rval = (CudaNdarray * )CudaNdarray_New(rval_nd); CudaNdarray * rval = (CudaNdarray * )CudaNdarray_New(rval_nd);
if (!rval || 0 != rval->data_allocated if (!rval || 0 != rval->data_allocated
||CudaNdarray_set_device_data(rval, CudaNdarray_DEV_DATA(self), self)) ||CudaNdarray_set_device_data(rval, CudaNdarray_DEV_DATA(self), self))
{ {
Py_XDECREF(rval); Py_XDECREF(rval);
free(rval_dims); free(rval_dims);
return NULL; return NULL;
} }
//set dim and stride //set dim and stride
int size = 1; int size = 1;
for (int i = rval_nd-1; i >= 0; --i) for (int i = rval_nd-1; i >= 0; --i)
{ {
CudaNdarray_set_stride(rval, i, (rval_dims[i] == 1) ? 0 : size); CudaNdarray_set_stride(rval, i, (rval_dims[i] == 1) ? 0 : size);
CudaNdarray_set_dim(rval, i, rval_dims[i]); CudaNdarray_set_dim(rval, i, rval_dims[i]);
size = size * rval_dims[i]; size = size * rval_dims[i];
} }
free(rval_dims); free(rval_dims);
return (PyObject*)rval; return (PyObject*)rval;
} }
// allocate new space (TODO: test to see if we can re-use old one) // allocate new space (TODO: test to see if we can re-use old one)
...@@ -662,21 +665,21 @@ PyObject * CudaNdarray_Reshape(CudaNdarray * self, PyObject * shape) ...@@ -662,21 +665,21 @@ PyObject * CudaNdarray_Reshape(CudaNdarray * self, PyObject * shape)
unsigned int threads_per_block = std::min(rval_size, (unsigned int)NUM_VECTOR_OP_THREADS_PER_BLOCK); unsigned int threads_per_block = std::min(rval_size, (unsigned int)NUM_VECTOR_OP_THREADS_PER_BLOCK);
unsigned int n_blocks = std::min(ceil_intdiv(rval_size,threads_per_block), (unsigned int)NUM_VECTOR_OP_BLOCKS); unsigned int n_blocks = std::min(ceil_intdiv(rval_size,threads_per_block), (unsigned int)NUM_VECTOR_OP_BLOCKS);
k_copy_reshape_rowmajor<<<n_blocks,threads_per_block>>>( k_copy_reshape_rowmajor<<<n_blocks,threads_per_block>>>(
rval_size, rval_size,
self->nd, self->nd,
CudaNdarray_DEV_DATA(self), CudaNdarray_DEV_DIMS(self), CudaNdarray_DEV_STRIDES(self), CudaNdarray_DEV_DATA(self), CudaNdarray_DEV_DIMS(self), CudaNdarray_DEV_STRIDES(self),
rval->nd, rval->nd,
CudaNdarray_DEV_DATA(rval), CudaNdarray_DEV_DIMS(rval), CudaNdarray_DEV_STRIDES(rval)); CudaNdarray_DEV_DATA(rval), CudaNdarray_DEV_DIMS(rval), CudaNdarray_DEV_STRIDES(rval));
CNDA_THREAD_SYNC; CNDA_THREAD_SYNC;
cudaError_t err = cudaGetLastError(); cudaError_t err = cudaGetLastError();
if( cudaSuccess != err) if( cudaSuccess != err)
{ {
Py_DECREF(rval); Py_DECREF(rval);
PyErr_Format(PyExc_RuntimeError, "Cuda error: %s: %s.\n", "k_copy_reshape_rowmajor", cudaGetErrorString(err)); PyErr_Format(PyExc_RuntimeError, "Cuda error: %s: %s.\n", "k_copy_reshape_rowmajor", cudaGetErrorString(err));
free(rval_dims); free(rval_dims);
return NULL; return NULL;
} }
free(rval_dims); free(rval_dims);
return (PyObject*)rval; return (PyObject*)rval;
} }
...@@ -702,7 +705,7 @@ PyObject * CudaNdarray_SetStride(CudaNdarray * self, PyObject *args) ...@@ -702,7 +705,7 @@ PyObject * CudaNdarray_SetStride(CudaNdarray * self, PyObject *args)
{ {
int pos, stride; int pos, stride;
if (! PyArg_ParseTuple(args, "ii", &pos, &stride)) if (! PyArg_ParseTuple(args, "ii", &pos, &stride))
return NULL; return NULL;
if ((pos < 0) || (pos >= self->nd)) if ((pos < 0) || (pos >= self->nd))
{ {
PyErr_Format(PyExc_ValueError, "position argument out of legal range [0, %i)", self->nd); PyErr_Format(PyExc_ValueError, "position argument out of legal range [0, %i)", self->nd);
...@@ -720,7 +723,7 @@ PyObject * CudaNdarray_SetShapeI(CudaNdarray * self, PyObject *args) ...@@ -720,7 +723,7 @@ PyObject * CudaNdarray_SetShapeI(CudaNdarray * self, PyObject *args)
{ {
int pos, dim; int pos, dim;
if (! PyArg_ParseTuple(args, "ii", &pos, &dim)) if (! PyArg_ParseTuple(args, "ii", &pos, &dim))
return NULL; return NULL;
if ((pos < 0) || (pos >= self->nd)) if ((pos < 0) || (pos >= self->nd))
{ {
PyErr_Format(PyExc_ValueError, "position argument out of legal range [0, %i)", self->nd); PyErr_Format(PyExc_ValueError, "position argument out of legal range [0, %i)", self->nd);
...@@ -751,37 +754,37 @@ CudaNdarray_exp(CudaNdarray* self) ...@@ -751,37 +754,37 @@ CudaNdarray_exp(CudaNdarray* self)
} }
unsigned int threads_per_block = std::min(size, (unsigned int)NUM_VECTOR_OP_THREADS_PER_BLOCK); unsigned int threads_per_block = std::min(size, (unsigned int)NUM_VECTOR_OP_THREADS_PER_BLOCK);
unsigned int n_blocks = std::min(ceil_intdiv(size,threads_per_block), (unsigned int)NUM_VECTOR_OP_BLOCKS); unsigned int n_blocks = std::min(ceil_intdiv(size,threads_per_block), (unsigned int)NUM_VECTOR_OP_BLOCKS);
k_elemwise_unary_rowmajor_exp<<<n_blocks,threads_per_block>>>(size, self->nd, CudaNdarray_DEV_DIMS(self), k_elemwise_unary_rowmajor_exp<<<n_blocks,threads_per_block>>>(size, self->nd, CudaNdarray_DEV_DIMS(self),
CudaNdarray_DEV_DATA(self), CudaNdarray_DEV_STRIDES(self), CudaNdarray_DEV_DATA(self), CudaNdarray_DEV_STRIDES(self),
CudaNdarray_DEV_DATA(rval), CudaNdarray_DEV_STRIDES(rval)); CudaNdarray_DEV_DATA(rval), CudaNdarray_DEV_STRIDES(rval));
//TODO: don't do this right away, do it when we need the result //TODO: don't do this right away, do it when we need the result
CNDA_THREAD_SYNC; CNDA_THREAD_SYNC;
cudaError_t err = cudaGetLastError(); cudaError_t err = cudaGetLastError();
if( cudaSuccess != err) if( cudaSuccess != err)
{ {
Py_DECREF(rval); Py_DECREF(rval);
PyErr_Format(PyExc_RuntimeError, "Cuda error: %s: %s.\n", "kExp", cudaGetErrorString(err)); PyErr_Format(PyExc_RuntimeError, "Cuda error: %s: %s.\n", "kExp", cudaGetErrorString(err));
return NULL; return NULL;
} }
return (PyObject*)rval; return (PyObject*)rval;
} }
static PyMethodDef CudaNdarray_methods[] = static PyMethodDef CudaNdarray_methods[] =
{ {
{"__array__", {"__array__",
(PyCFunction)CudaNdarray_CreateArrayObj, METH_NOARGS, (PyCFunction)CudaNdarray_CreateArrayObj, METH_NOARGS,
"Copy from the device to a numpy ndarray"}, "Copy from the device to a numpy ndarray"},
{"__copy__", {"__copy__",
(PyCFunction)CudaNdarray_View, METH_NOARGS, (PyCFunction)CudaNdarray_View, METH_NOARGS,
"Create a shallow copy of this object. used by module copy"}, "Create a shallow copy of this object. used by module copy"},
{"__deepcopy__", {"__deepcopy__",
(PyCFunction)CudaNdarray_DeepCopy, METH_O, (PyCFunction)CudaNdarray_DeepCopy, METH_O,
"Create a copy of this object"}, "Create a copy of this object"},
{"zeros", {"zeros",
(PyCFunction)CudaNdarray_Zeros, METH_STATIC, (PyCFunction)CudaNdarray_Zeros, METH_STATIC,
"Create a new CudaNdarray with specified shape, filled with zeros."}, "Create a new CudaNdarray with specified shape, filled with zeros."},
{"copy", {"copy",
(PyCFunction)CudaNdarray_Copy, METH_NOARGS, (PyCFunction)CudaNdarray_Copy, METH_NOARGS,
"Create a copy of this object"}, "Create a copy of this object"},
{"reduce_sum", {"reduce_sum",
...@@ -791,7 +794,7 @@ static PyMethodDef CudaNdarray_methods[] = ...@@ -791,7 +794,7 @@ static PyMethodDef CudaNdarray_methods[] =
(PyCFunction)CudaNdarray_exp, METH_NOARGS, (PyCFunction)CudaNdarray_exp, METH_NOARGS,
"Return the exponential of all elements"}, "Return the exponential of all elements"},
{"reshape", {"reshape",
(PyCFunction)CudaNdarray_Reshape, METH_O, (PyCFunction)CudaNdarray_Reshape, METH_O,
"Return a reshaped view (or copy) of this ndarray\n\ "Return a reshaped view (or copy) of this ndarray\n\
The required argument is a tuple of integers specifying the shape of the new ndarray."}, The required argument is a tuple of integers specifying the shape of the new ndarray."},
{"view", {"view",
...@@ -836,13 +839,13 @@ CudaNdarray_add(PyObject* py_self, PyObject * py_other) ...@@ -836,13 +839,13 @@ CudaNdarray_add(PyObject* py_self, PyObject * py_other)
CudaNdarray * other = (CudaNdarray *)py_other; CudaNdarray * other = (CudaNdarray *)py_other;
if(!CudaNdarray_is_c_contiguous(self) || !CudaNdarray_is_c_contiguous(other)){ if(!CudaNdarray_is_c_contiguous(self) || !CudaNdarray_is_c_contiguous(other)){
PyErr_SetString(PyExc_TypeError, "We have implementet only the c_contiguous version for now."); PyErr_SetString(PyExc_TypeError, "We have implementet only the c_contiguous version for now.");
return NULL; return NULL;
} }
//standard elemwise size checks //standard elemwise size checks
if (self->nd != other->nd) if (self->nd != other->nd)
{ {
PyErr_SetString(PyExc_TypeError, "need same number of dims"); PyErr_SetString(PyExc_TypeError, "CudaNdarray_add: need same number of dims");
return NULL; return NULL;
} }
//standard elemwise dim checks //standard elemwise dim checks
...@@ -873,7 +876,7 @@ CudaNdarray_add(PyObject* py_self, PyObject * py_other) ...@@ -873,7 +876,7 @@ CudaNdarray_add(PyObject* py_self, PyObject * py_other)
self->devdata, other->devdata, rval->devdata, size); self->devdata, other->devdata, rval->devdata, size);
CNDA_THREAD_SYNC; CNDA_THREAD_SYNC;
cudaError_t err = cudaGetLastError(); cudaError_t err = cudaGetLastError();
if( cudaSuccess != err) if( cudaSuccess != err)
{ {
PyErr_Format(PyExc_RuntimeError, "Cuda error: %s: %s.\n", "kAdd", cudaGetErrorString(err)); PyErr_Format(PyExc_RuntimeError, "Cuda error: %s: %s.\n", "kAdd", cudaGetErrorString(err));
Py_DECREF(rval); Py_DECREF(rval);
...@@ -890,7 +893,7 @@ __global__ void name(const int d0, const int d1, const int d2,\ ...@@ -890,7 +893,7 @@ __global__ void name(const int d0, const int d1, const int d2,\
for (int i0 = blockIdx.x; i0 < d0; i0 += gridDim.x){\ for (int i0 = blockIdx.x; i0 < d0; i0 += gridDim.x){\
for (int i1 = blockIdx.y; i1 < d1; i1 += gridDim.y){\ for (int i1 = blockIdx.y; i1 < d1; i1 += gridDim.y){\
for (int i2 = threadIdx.x; i2 < d2; i2 += blockDim.x){\ for (int i2 = threadIdx.x; i2 < d2; i2 += blockDim.x){\
F(a[i0*sA0 + i1*sA1 + i2*sA2], b[i0*sB0 + i1*sB1 + i2*sB2]); \ F(a[i0*sA0 + i1*sA1 + i2*sA2], b[i0*sB0 + i1*sB1 + i2*sB2]); \
}\ }\
}\ }\
}\ }\
...@@ -898,16 +901,16 @@ __global__ void name(const int d0, const int d1, const int d2,\ ...@@ -898,16 +901,16 @@ __global__ void name(const int d0, const int d1, const int d2,\
#define decl_k_elemwise_binary_inplace_rowmajor_4(name, F) \ #define decl_k_elemwise_binary_inplace_rowmajor_4(name, F) \
__global__ void name(const int d0, const int d1, const int d2, const int d3,\ __global__ void name(const int d0, const int d1, const int d2, const int d3,\
float* a, const int sA0, const int sA1,\ float* a, const int sA0, const int sA1,\
const int sA2, const int sA3,\ const int sA2, const int sA3,\
const float* b, const int sB0, const int sB1,\ const float* b, const int sB0, const int sB1,\
const int sB2, const int sB3){\ const int sB2, const int sB3){\
for (int i0 = blockIdx.x; i0 < d0; i0 += gridDim.x){\ for (int i0 = blockIdx.x; i0 < d0; i0 += gridDim.x){\
for (int i1 = blockIdx.y; i1 < d1; i1 += gridDim.y){\ for (int i1 = blockIdx.y; i1 < d1; i1 += gridDim.y){\
for (int i2 = threadIdx.x; i2 < d2; i2 += blockDim.x){\ for (int i2 = threadIdx.x; i2 < d2; i2 += blockDim.x){\
for (int i3 = threadIdx.y; i3 < d3; i3 += blockDim.y){\ for (int i3 = threadIdx.y; i3 < d3; i3 += blockDim.y){\
F(a[i0*sA0 + i1*sA1 + i2*sA2 + i3*sA3], b[i0*sB0 + i1*sB1 + i2*sB2 + i3*sB3]); \ F(a[i0*sA0 + i1*sA1 + i2*sA2 + i3*sA3], b[i0*sB0 + i1*sB1 + i2*sB2 + i3*sB3]); \
}\ }\
}\ }\
}\ }\
}\ }\
...@@ -927,23 +930,23 @@ __global__ void k_iAdd_3(const int d0, const int d1, const int d2,\ ...@@ -927,23 +930,23 @@ __global__ void k_iAdd_3(const int d0, const int d1, const int d2,\
for (int i0 = blockIdx.x; i0 < d0; i0 += gridDim.x){\ for (int i0 = blockIdx.x; i0 < d0; i0 += gridDim.x){\
for (int i1 = blockIdx.y; i1 < d1; i1 += gridDim.y){\ for (int i1 = blockIdx.y; i1 < d1; i1 += gridDim.y){\
for (int i2 = threadIdx.x; i2 < d2; i2 += blockDim.x){\ for (int i2 = threadIdx.x; i2 < d2; i2 += blockDim.x){\
a[i0*sA0 + i1*sA1 + i2*sA2]+= b[i0*sB0 + i1*sB1 + i2*sB2]; \ a[i0*sA0 + i1*sA1 + i2*sA2]+= b[i0*sB0 + i1*sB1 + i2*sB2]; \
}\ }\
}\ }\
}\ }\
} }
__global__ void k_iAdd_4(const int d0, const int d1, const int d2, const int d3,\ __global__ void k_iAdd_4(const int d0, const int d1, const int d2, const int d3,\
float* a, const int sA0, const int sA1,\ float* a, const int sA0, const int sA1,\
const int sA2, const int sA3,\ const int sA2, const int sA3,\
const float* b, const int sB0, const int sB1,\ const float* b, const int sB0, const int sB1,\
const int sB2, const int sB3){\ const int sB2, const int sB3){\
for (int i0 = blockIdx.x; i0 < d0; i0 += gridDim.x){\ for (int i0 = blockIdx.x; i0 < d0; i0 += gridDim.x){\
for (int i1 = blockIdx.y; i1 < d1; i1 += gridDim.y){\ for (int i1 = blockIdx.y; i1 < d1; i1 += gridDim.y){\
for (int i2 = threadIdx.x; i2 < d2; i2 += blockDim.x){\ for (int i2 = threadIdx.x; i2 < d2; i2 += blockDim.x){\
for (int i3 = threadIdx.y; i3 < d3; i3 += blockDim.y){\ for (int i3 = threadIdx.y; i3 < d3; i3 += blockDim.y){\
a[i0*sA0 + i1*sA1 + i2*sA2 + i3*sA3] += b[i0*sB0 + i1*sB1 + i2*sB2 + i3*sB3]; \ a[i0*sA0 + i1*sA1 + i2*sA2 + i3*sA3] += b[i0*sB0 + i1*sB1 + i2*sB2 + i3*sB3]; \
}\ }\
}\ }\
}\ }\
}\ }\
...@@ -955,23 +958,23 @@ __global__ void k_iDiv_3(const int d0, const int d1, const int d2,\ ...@@ -955,23 +958,23 @@ __global__ void k_iDiv_3(const int d0, const int d1, const int d2,\
for (int i0 = blockIdx.x; i0 < d0; i0 += gridDim.x){\ for (int i0 = blockIdx.x; i0 < d0; i0 += gridDim.x){\
for (int i1 = blockIdx.y; i1 < d1; i1 += gridDim.y){\ for (int i1 = blockIdx.y; i1 < d1; i1 += gridDim.y){\
for (int i2 = threadIdx.x; i2 < d2; i2 += blockDim.x){\ for (int i2 = threadIdx.x; i2 < d2; i2 += blockDim.x){\
a[i0*sA0 + i1*sA1 + i2*sA2]/= b[i0*sB0 + i1*sB1 + i2*sB2]; \ a[i0*sA0 + i1*sA1 + i2*sA2]/= b[i0*sB0 + i1*sB1 + i2*sB2]; \
}\ }\
}\ }\
}\ }\
} }
__global__ void k_iDiv_4(const int d0, const int d1, const int d2, const int d3,\ __global__ void k_iDiv_4(const int d0, const int d1, const int d2, const int d3,\
float* a, const int sA0, const int sA1,\ float* a, const int sA0, const int sA1,\
const int sA2, const int sA3,\ const int sA2, const int sA3,\
const float* b, const int sB0, const int sB1,\ const float* b, const int sB0, const int sB1,\
const int sB2, const int sB3){\ const int sB2, const int sB3){\
for (int i0 = blockIdx.x; i0 < d0; i0 += gridDim.x){\ for (int i0 = blockIdx.x; i0 < d0; i0 += gridDim.x){\
for (int i1 = blockIdx.y; i1 < d1; i1 += gridDim.y){\ for (int i1 = blockIdx.y; i1 < d1; i1 += gridDim.y){\
for (int i2 = threadIdx.x; i2 < d2; i2 += blockDim.x){\ for (int i2 = threadIdx.x; i2 < d2; i2 += blockDim.x){\
for (int i3 = threadIdx.y; i3 < d3; i3 += blockDim.y){\ for (int i3 = threadIdx.y; i3 < d3; i3 += blockDim.y){\
a[i0*sA0 + i1*sA1 + i2*sA2 + i3*sA3] /= b[i0*sB0 + i1*sB1 + i2*sB2 + i3*sB3]; \ a[i0*sA0 + i1*sA1 + i2*sA2 + i3*sA3] /= b[i0*sB0 + i1*sB1 + i2*sB2 + i3*sB3]; \
}\ }\
}\ }\
}\ }\
}\ }\
...@@ -1002,7 +1005,7 @@ CudaNdarray_inplace_add_div(PyObject* py_self, PyObject * py_other, int fct_nb) ...@@ -1002,7 +1005,7 @@ CudaNdarray_inplace_add_div(PyObject* py_self, PyObject * py_other, int fct_nb)
//standard elemwise size checks //standard elemwise size checks
if (self->nd != other->nd) if (self->nd != other->nd)
{ {
PyErr_SetString(PyExc_TypeError, "need same number of dims"); PyErr_Format(PyExc_TypeError, "CudaNdarray_inplace_add_div: need same number of dims. Got %d and %d", self->nd, other->nd);
return NULL; return NULL;
} }
//standard elemwise dim checks //standard elemwise dim checks
...@@ -1017,19 +1020,19 @@ CudaNdarray_inplace_add_div(PyObject* py_self, PyObject * py_other, int fct_nb) ...@@ -1017,19 +1020,19 @@ CudaNdarray_inplace_add_div(PyObject* py_self, PyObject * py_other, int fct_nb)
} }
size *= (unsigned int) CudaNdarray_HOST_DIMS(self)[i]; size *= (unsigned int) CudaNdarray_HOST_DIMS(self)[i];
} }
if(CudaNdarray_SIZE((CudaNdarray *)py_self)==0 && CudaNdarray_SIZE((CudaNdarray *)py_other)==0){ if(CudaNdarray_SIZE((CudaNdarray *)py_self)==0 && CudaNdarray_SIZE((CudaNdarray *)py_other)==0){
Py_INCREF(py_self); Py_INCREF(py_self);
return py_self; return py_self;
} }
void (*k_iop_3)(const int, const int, const int, void (*k_iop_3)(const int, const int, const int,
float*, const int, const int, const int, float*, const int, const int, const int,
const float*, const int, const int, const int); const float*, const int, const int, const int);
void (*k_iop_4)(const int, const int, const int, const int, void (*k_iop_4)(const int, const int, const int, const int,
float*, const int, const int, float*, const int, const int,
const int, const int, const int, const int,
const float*, const int, const int, const float*, const int, const int,
const int, const int); const int, const int);
if(fct_nb == 0){ if(fct_nb == 0){
k_iop_3 = k_iAdd_3; k_iop_3 = k_iAdd_3;
k_iop_4 = k_iAdd_4; k_iop_4 = k_iAdd_4;
...@@ -1037,7 +1040,7 @@ CudaNdarray_inplace_add_div(PyObject* py_self, PyObject * py_other, int fct_nb) ...@@ -1037,7 +1040,7 @@ CudaNdarray_inplace_add_div(PyObject* py_self, PyObject * py_other, int fct_nb)
k_iop_3 = k_iDiv_3; k_iop_3 = k_iDiv_3;
k_iop_4 = k_iDiv_4; k_iop_4 = k_iDiv_4;
} }
switch(self->nd) switch(self->nd)
{ {
case 1: case 1:
...@@ -1059,7 +1062,7 @@ CudaNdarray_inplace_add_div(PyObject* py_self, PyObject * py_other, int fct_nb) ...@@ -1059,7 +1062,7 @@ CudaNdarray_inplace_add_div(PyObject* py_self, PyObject * py_other, int fct_nb)
CudaNdarray_HOST_STRIDES(other)[0]); CudaNdarray_HOST_STRIDES(other)[0]);
CNDA_THREAD_SYNC; CNDA_THREAD_SYNC;
cudaError_t err = cudaGetLastError(); cudaError_t err = cudaGetLastError();
if( cudaSuccess != err) if( cudaSuccess != err)
{ {
PyErr_Format(PyExc_RuntimeError, "Cuda error: %s: %s.\n", "k_iop_3", cudaGetErrorString(err)); PyErr_Format(PyExc_RuntimeError, "Cuda error: %s: %s.\n", "k_iop_3", cudaGetErrorString(err));
return NULL; return NULL;
...@@ -1075,7 +1078,7 @@ CudaNdarray_inplace_add_div(PyObject* py_self, PyObject * py_other, int fct_nb) ...@@ -1075,7 +1078,7 @@ CudaNdarray_inplace_add_div(PyObject* py_self, PyObject * py_other, int fct_nb)
dim3 n_threads( dim3 n_threads(
std::min(CudaNdarray_HOST_DIMS(self)[1], NUM_VECTOR_OP_THREADS_PER_BLOCK) std::min(CudaNdarray_HOST_DIMS(self)[1], NUM_VECTOR_OP_THREADS_PER_BLOCK)
); );
k_iop_3<<<n_blocks, n_threads>>>(1, k_iop_3<<<n_blocks, n_threads>>>(1,
CudaNdarray_HOST_DIMS(self)[0], CudaNdarray_HOST_DIMS(self)[0],
CudaNdarray_HOST_DIMS(self)[1], CudaNdarray_HOST_DIMS(self)[1],
CudaNdarray_DEV_DATA(self), CudaNdarray_DEV_DATA(self),
...@@ -1088,7 +1091,7 @@ CudaNdarray_inplace_add_div(PyObject* py_self, PyObject * py_other, int fct_nb) ...@@ -1088,7 +1091,7 @@ CudaNdarray_inplace_add_div(PyObject* py_self, PyObject * py_other, int fct_nb)
CudaNdarray_HOST_STRIDES(other)[1]); CudaNdarray_HOST_STRIDES(other)[1]);
CNDA_THREAD_SYNC; CNDA_THREAD_SYNC;
cudaError_t err = cudaGetLastError(); cudaError_t err = cudaGetLastError();
if( cudaSuccess != err) if( cudaSuccess != err)
{ {
PyErr_Format(PyExc_RuntimeError, "Cuda error: %s: %s.\n", "k_iop_3", cudaGetErrorString(err)); PyErr_Format(PyExc_RuntimeError, "Cuda error: %s: %s.\n", "k_iop_3", cudaGetErrorString(err));
return NULL; return NULL;
...@@ -1106,7 +1109,7 @@ CudaNdarray_inplace_add_div(PyObject* py_self, PyObject * py_other, int fct_nb) ...@@ -1106,7 +1109,7 @@ CudaNdarray_inplace_add_div(PyObject* py_self, PyObject * py_other, int fct_nb)
dim3 n_threads( dim3 n_threads(
std::min(CudaNdarray_HOST_DIMS(self)[2], NUM_VECTOR_OP_THREADS_PER_BLOCK) std::min(CudaNdarray_HOST_DIMS(self)[2], NUM_VECTOR_OP_THREADS_PER_BLOCK)
); );
k_iop_3<<<n_blocks, n_threads>>>( k_iop_3<<<n_blocks, n_threads>>>(
CudaNdarray_HOST_DIMS(self)[0], CudaNdarray_HOST_DIMS(self)[0],
CudaNdarray_HOST_DIMS(self)[1], CudaNdarray_HOST_DIMS(self)[1],
CudaNdarray_HOST_DIMS(self)[2], CudaNdarray_HOST_DIMS(self)[2],
...@@ -1120,7 +1123,7 @@ CudaNdarray_inplace_add_div(PyObject* py_self, PyObject * py_other, int fct_nb) ...@@ -1120,7 +1123,7 @@ CudaNdarray_inplace_add_div(PyObject* py_self, PyObject * py_other, int fct_nb)
CudaNdarray_HOST_STRIDES(other)[2]); CudaNdarray_HOST_STRIDES(other)[2]);
CNDA_THREAD_SYNC; CNDA_THREAD_SYNC;
cudaError_t err = cudaGetLastError(); cudaError_t err = cudaGetLastError();
if( cudaSuccess != err) if( cudaSuccess != err)
{ {
PyErr_Format(PyExc_RuntimeError, "Cuda error: %s: %s.\n", "k_iop_3", cudaGetErrorString(err)); PyErr_Format(PyExc_RuntimeError, "Cuda error: %s: %s.\n", "k_iop_3", cudaGetErrorString(err));
return NULL; return NULL;
...@@ -1138,7 +1141,7 @@ CudaNdarray_inplace_add_div(PyObject* py_self, PyObject * py_other, int fct_nb) ...@@ -1138,7 +1141,7 @@ CudaNdarray_inplace_add_div(PyObject* py_self, PyObject * py_other, int fct_nb)
dim3 n_threads( dim3 n_threads(
std::min(CudaNdarray_HOST_DIMS(self)[2], NUM_VECTOR_OP_THREADS_PER_BLOCK) std::min(CudaNdarray_HOST_DIMS(self)[2], NUM_VECTOR_OP_THREADS_PER_BLOCK)
); );
k_iop_4<<<n_blocks, n_threads>>>( k_iop_4<<<n_blocks, n_threads>>>(
CudaNdarray_HOST_DIMS(self)[0], CudaNdarray_HOST_DIMS(self)[0],
CudaNdarray_HOST_DIMS(self)[1], CudaNdarray_HOST_DIMS(self)[1],
CudaNdarray_HOST_DIMS(self)[2], CudaNdarray_HOST_DIMS(self)[2],
...@@ -1155,7 +1158,7 @@ CudaNdarray_inplace_add_div(PyObject* py_self, PyObject * py_other, int fct_nb) ...@@ -1155,7 +1158,7 @@ CudaNdarray_inplace_add_div(PyObject* py_self, PyObject * py_other, int fct_nb)
CudaNdarray_HOST_STRIDES(other)[3]); CudaNdarray_HOST_STRIDES(other)[3]);
CNDA_THREAD_SYNC; CNDA_THREAD_SYNC;
cudaError_t err = cudaGetLastError(); cudaError_t err = cudaGetLastError();
if( cudaSuccess != err) if( cudaSuccess != err)
{ {
PyErr_Format(PyExc_RuntimeError, "Cuda error: %s: %s.\n", "k_iop_4", cudaGetErrorString(err)); PyErr_Format(PyExc_RuntimeError, "Cuda error: %s: %s.\n", "k_iop_4", cudaGetErrorString(err));
return NULL; return NULL;
...@@ -1192,7 +1195,7 @@ CudaNdarray_inplace_add_div(PyObject* py_self, PyObject * py_other, int fct_nb) ...@@ -1192,7 +1195,7 @@ CudaNdarray_inplace_add_div(PyObject* py_self, PyObject * py_other, int fct_nb)
CudaNdarray_HOST_STRIDES(other)[4]); CudaNdarray_HOST_STRIDES(other)[4]);
CNDA_THREAD_SYNC; CNDA_THREAD_SYNC;
cudaError_t err = cudaGetLastError(); cudaError_t err = cudaGetLastError();
if( cudaSuccess != err) if( cudaSuccess != err)
{ {
PyErr_Format(PyExc_RuntimeError, "Cuda error: %s: %s.\n", "k_iop_4", cudaGetErrorString(err)); PyErr_Format(PyExc_RuntimeError, "Cuda error: %s: %s.\n", "k_iop_4", cudaGetErrorString(err));
return NULL; return NULL;
...@@ -1214,7 +1217,8 @@ CudaNdarray_inplace_add_div(PyObject* py_self, PyObject * py_other, int fct_nb) ...@@ -1214,7 +1217,8 @@ CudaNdarray_inplace_add_div(PyObject* py_self, PyObject * py_other, int fct_nb)
static PyObject * static PyObject *
CudaNdarray_inplace_add(PyObject* py_self, PyObject * py_other){ CudaNdarray_inplace_add(PyObject* py_self, PyObject * py_other){
CudaNdarray_inplace_add_div(py_self, py_other, 0); CudaNdarray_inplace_add_div(py_self, py_other, 0);
Py_INCREF(py_self); //We should not increment the refcount as we are doing inplace operation
//And in this syntax, their is no additional reference created!
return py_self; return py_self;
} }
...@@ -1225,7 +1229,8 @@ CudaNdarray_inplace_add(PyObject* py_self, PyObject * py_other){ ...@@ -1225,7 +1229,8 @@ CudaNdarray_inplace_add(PyObject* py_self, PyObject * py_other){
static PyObject * static PyObject *
CudaNdarray_inplace_div(PyObject* py_self, PyObject * py_other){ CudaNdarray_inplace_div(PyObject* py_self, PyObject * py_other){
CudaNdarray_inplace_add_div(py_self, py_other, 1); CudaNdarray_inplace_add_div(py_self, py_other, 1);
Py_INCREF(py_self); //We should not increment the refcount as we are doing inplace operation
//And in this syntax, their is no additional reference created!
return py_self; return py_self;
} }
...@@ -1286,7 +1291,7 @@ static PyNumberMethods CudaNdarrayNumberMethods = ...@@ -1286,7 +1291,7 @@ static PyNumberMethods CudaNdarrayNumberMethods =
///////////////////// /////////////////////
// Will by called by __len__ in Python // Will by called by __len__ in Python
static Py_ssize_t static Py_ssize_t
CudaNdarray_len(PyObject * py_self) CudaNdarray_len(PyObject * py_self)
{ {
CudaNdarray * self = (CudaNdarray*) py_self; CudaNdarray * self = (CudaNdarray*) py_self;
...@@ -1423,7 +1428,7 @@ CudaNdarray_Subscript(PyObject * py_self, PyObject * key) ...@@ -1423,7 +1428,7 @@ CudaNdarray_Subscript(PyObject * py_self, PyObject * key)
{ {
//elements of the tuple can be either integers or slices //elements of the tuple can be either integers or slices
//the dimensionality of the view we will return is diminished for each slice in the tuple //the dimensionality of the view we will return is diminished for each slice in the tuple
if (PyTuple_Size(key) > self->nd) if (PyTuple_Size(key) > self->nd)
{ {
PyErr_SetString(PyExc_IndexError, "index error"); PyErr_SetString(PyExc_IndexError, "index error");
...@@ -1434,9 +1439,9 @@ CudaNdarray_Subscript(PyObject * py_self, PyObject * key) ...@@ -1434,9 +1439,9 @@ CudaNdarray_Subscript(PyObject * py_self, PyObject * key)
int rval_nd = self->nd; int rval_nd = self->nd;
for (int d = 0; d < PyTuple_Size(key); ++d) for (int d = 0; d < PyTuple_Size(key); ++d)
{ {
//On some paltform PyInt_Check(<type 'numpy.int64'>) return true, other it return false. //On some paltform PyInt_Check(<type 'numpy.int64'>) return true, other it return false.
//So we use PyArray_IsAnyScalar that should covert everything. //So we use PyArray_IsAnyScalar that should covert everything.
rval_nd -= PyArray_IsAnyScalar(PyTuple_GetItem(key, d)); rval_nd -= PyArray_IsAnyScalar(PyTuple_GetItem(key, d));
} }
//allocate our subtensor view //allocate our subtensor view
...@@ -1452,7 +1457,7 @@ CudaNdarray_Subscript(PyObject * py_self, PyObject * key) ...@@ -1452,7 +1457,7 @@ CudaNdarray_Subscript(PyObject * py_self, PyObject * key)
return NULL; return NULL;
} }
// rval_d will refer to the current dimension in the rval. // rval_d will refer to the current dimension in the rval.
// It will not be incremented for integer keys, but will be incremented for slice // It will not be incremented for integer keys, but will be incremented for slice
// keys // keys
int rval_d = 0; int rval_d = 0;
...@@ -1461,7 +1466,7 @@ CudaNdarray_Subscript(PyObject * py_self, PyObject * key) ...@@ -1461,7 +1466,7 @@ CudaNdarray_Subscript(PyObject * py_self, PyObject * key)
{ {
// keys can be shorter than self->nd. // keys can be shorter than self->nd.
// when that happens, it means that the remaining dimensions are "full slices" // when that happens, it means that the remaining dimensions are "full slices"
if (d >=PyTuple_Size(key)) if (d >=PyTuple_Size(key))
{ {
CudaNdarray_set_stride(rval, rval_d, CudaNdarray_HOST_STRIDES(self)[d]); CudaNdarray_set_stride(rval, rval_d, CudaNdarray_HOST_STRIDES(self)[d]);
CudaNdarray_set_dim(rval, rval_d, CudaNdarray_HOST_DIMS(self)[d]); CudaNdarray_set_dim(rval, rval_d, CudaNdarray_HOST_DIMS(self)[d]);
...@@ -1550,53 +1555,53 @@ CudaNdarray_setitem(PyObject *o, PyObject *key, PyObject *v) ...@@ -1550,53 +1555,53 @@ CudaNdarray_setitem(PyObject *o, PyObject *key, PyObject *v)
if(CudaNdarray_Check(o) && PyArray_Check(v)){ if(CudaNdarray_Check(o) && PyArray_Check(v)){
// We try to copy directly into this CudaNdarray from the ndarray // We try to copy directly into this CudaNdarray from the ndarray
CudaNdarray* rval = (CudaNdarray*)CudaNdarray_Subscript(o, key); CudaNdarray* rval = (CudaNdarray*)CudaNdarray_Subscript(o, key);
int typenum = PyArray_TYPE(v); int typenum = PyArray_TYPE(v);
if(!rval){ if(!rval){
// CudaNdarray_Subscript failed and set the error msg. // CudaNdarray_Subscript failed and set the error msg.
Py_XDECREF(rval); Py_XDECREF(rval);
return -1; return -1;
} }
if (typenum != REAL_TYPENUM){ if (typenum != REAL_TYPENUM){
PyErr_SetString(PyExc_TypeError, "CudaNdarray.__setitem__: can only copy from float32 arrays"); PyErr_SetString(PyExc_TypeError, "CudaNdarray.__setitem__: can only copy from float32 arrays");
Py_XDECREF(rval); Py_XDECREF(rval);
return -1; return -1;
} }
if(! CudaNdarray_is_c_contiguous(rval)){ if(! CudaNdarray_is_c_contiguous(rval)){
PyErr_SetString(PyExc_NotImplementedError, "CudaNdarray.__setitem__: When the new value is an ndarray the part where we copy it to must be c contiguous."); PyErr_SetString(PyExc_NotImplementedError, "CudaNdarray.__setitem__: When the new value is an ndarray the part where we copy it to must be c contiguous.");
Py_XDECREF(rval); Py_XDECREF(rval);
return -1; return -1;
} }
if(rval->nd != ((PyArrayObject*)v)->nd){ if(rval->nd != ((PyArrayObject*)v)->nd){
PyErr_Format(PyExc_NotImplementedError, "CudaNdarray.__setitem__: need same number of dims. destination nd=%d, source nd=%d. No broadcasting implemented.", PyErr_Format(PyExc_NotImplementedError, "CudaNdarray.__setitem__: need same number of dims. destination nd=%d, source nd=%d. No broadcasting implemented.",
rval->nd,((PyArrayObject*)v)->nd); rval->nd,((PyArrayObject*)v)->nd);
Py_XDECREF(rval); Py_XDECREF(rval);
return -1; return -1;
} }
for(int i=0 ; i<rval->nd ; i++){ for(int i=0 ; i<rval->nd ; i++){
if(CudaNdarray_HOST_DIMS(rval)[i] != ((PyArrayObject*)v)->dimensions[i]){ if(CudaNdarray_HOST_DIMS(rval)[i] != ((PyArrayObject*)v)->dimensions[i]){
PyErr_Format(PyExc_ValueError, "CudaNdarray.__setitem__: need same dimensions for dim %d, destination=%d, source=%ld", PyErr_Format(PyExc_ValueError, "CudaNdarray.__setitem__: need same dimensions for dim %d, destination=%d, source=%ld",
i, i,
CudaNdarray_HOST_DIMS(rval)[i], CudaNdarray_HOST_DIMS(rval)[i],
(long int)(((PyArrayObject*)v)->dimensions[i])); (long int)(((PyArrayObject*)v)->dimensions[i]));
Py_XDECREF(rval); Py_XDECREF(rval);
return -1; return -1;
} }
} }
PyArrayObject * py_v = (PyArrayObject*)PyArray_ContiguousFromAny((PyObject*)v, typenum, PyArrayObject * py_v = (PyArrayObject*)PyArray_ContiguousFromAny((PyObject*)v, typenum,
rval->nd, rval->nd); rval->nd, rval->nd);
cublasSetVector(PyArray_SIZE(py_v), cublasSetVector(PyArray_SIZE(py_v),
sizeof(real), sizeof(real),
PyArray_DATA(py_v), 1, PyArray_DATA(py_v), 1,
rval->devdata, 1); rval->devdata, 1);
CNDA_THREAD_SYNC; CNDA_THREAD_SYNC;
Py_XDECREF(py_v); Py_XDECREF(py_v);
Py_XDECREF(rval); Py_XDECREF(rval);
if (CUBLAS_STATUS_SUCCESS != cublasGetError()){ if (CUBLAS_STATUS_SUCCESS != cublasGetError()){
PyErr_SetString(PyExc_RuntimeError, "CudaNdarray.__setitem__: error copying ndarray data to device memory"); PyErr_SetString(PyExc_RuntimeError, "CudaNdarray.__setitem__: error copying ndarray data to device memory");
return -1; return -1;
} }
return 0; return 0;
} }
...@@ -1742,29 +1747,29 @@ CudaNdarray_get_dtype(CudaNdarray *self, void *closure) ...@@ -1742,29 +1747,29 @@ CudaNdarray_get_dtype(CudaNdarray *self, void *closure)
} }
static PyGetSetDef CudaNdarray_getset[] = { static PyGetSetDef CudaNdarray_getset[] = {
{"shape", {"shape",
(getter)CudaNdarray_get_shape, (getter)CudaNdarray_get_shape,
(setter)CudaNdarray_set_shape, (setter)CudaNdarray_set_shape,
"shape of this ndarray (tuple)", "shape of this ndarray (tuple)",
NULL}, NULL},
{"_strides", {"_strides",
(getter)CudaNdarray_get_strides, (getter)CudaNdarray_get_strides,
(setter)CudaNdarray_set_strides, (setter)CudaNdarray_set_strides,
"data pointer strides (in elements)", "data pointer strides (in elements)",
NULL}, NULL},
//gpudata is needed to allow calling pycuda fct with CudaNdarray input. //gpudata is needed to allow calling pycuda fct with CudaNdarray input.
{"gpudata", {"gpudata",
(getter)CudaNdarray_get_dev_data, (getter)CudaNdarray_get_dev_data,
NULL,//setter)CudaNdarray_set_dev_data, NULL,//setter)CudaNdarray_set_dev_data,
"device data pointer", "device data pointer",
NULL}, NULL},
{"_dev_data", {"_dev_data",
(getter)CudaNdarray_get_dev_data, (getter)CudaNdarray_get_dev_data,
(setter)CudaNdarray_set_dev_data, (setter)CudaNdarray_set_dev_data,
"device data pointer", "device data pointer",
NULL}, NULL},
{"dtype", {"dtype",
(getter)CudaNdarray_get_dtype, (getter)CudaNdarray_get_dtype,
NULL, NULL,
"The dtype of the element. Now always float32", "The dtype of the element. Now always float32",
NULL}, NULL},
...@@ -1785,7 +1790,7 @@ static PyGetSetDef CudaNdarray_getset[] = { ...@@ -1785,7 +1790,7 @@ static PyGetSetDef CudaNdarray_getset[] = {
static PyTypeObject CudaNdarrayType = static PyTypeObject CudaNdarrayType =
{ {
PyObject_HEAD_INIT(NULL) PyObject_HEAD_INIT(NULL)
0, /*ob_size*/ 0, /*ob_size*/
...@@ -1828,6 +1833,33 @@ static PyTypeObject CudaNdarrayType = ...@@ -1828,6 +1833,33 @@ static PyTypeObject CudaNdarrayType =
CudaNdarray_new, /* tp_new */ CudaNdarray_new, /* tp_new */
}; };
static __global__ void get_gpu_ptr_size(int* dst)
{
dst[0] = sizeof(float*);
}
PyObject *
CudaNdarray_ptr_int_size(PyObject* _unused, PyObject* args)
{
int *gpu_data = (int*)device_malloc(sizeof(int));
if(gpu_data == NULL){
return PyErr_Format(PyExc_MemoryError,
"CudaNdarray_ptr_int_size: Can't allocate memory on the gpu.");
}
get_gpu_ptr_size<<<1,1>>>(gpu_data);
if (cudaSuccess != cublasGetError()){
return PyErr_Format(PyExc_RuntimeError,
"CudaNdarray_ptr_int_size: error when calling the gpu code.");
}
int gpu_ptr_size = -1;
cublasGetVector(1, sizeof(int), gpu_data, 1, &gpu_ptr_size, 1);
device_free(gpu_data);
if (CUBLAS_STATUS_SUCCESS != cublasGetError()){
PyErr_SetString(PyExc_RuntimeError, "error copying data to from memory");
return NULL;
}
return Py_BuildValue("iii", gpu_ptr_size, sizeof(float*), sizeof(int));
}
// Initialize the gpu. // Initialize the gpu.
// Takes one optional parameter, the device number. // Takes one optional parameter, the device number.
...@@ -1842,7 +1874,7 @@ CudaNdarray_gpu_init(PyObject* _unused, PyObject* args) ...@@ -1842,7 +1874,7 @@ CudaNdarray_gpu_init(PyObject* _unused, PyObject* args)
int card_number_provided = 1; int card_number_provided = 1;
PyArg_ParseTuple(args, "|i", &card_nb); // if we're given something wildly invalid, this will throw a TypeError PyArg_ParseTuple(args, "|i", &card_nb); // if we're given something wildly invalid, this will throw a TypeError
if(PyTuple_Size(args) == 0) { if(PyTuple_Size(args) == 0) {
card_number_provided = 0; card_number_provided = 0;
card_nb = 0; card_nb = 0;
...@@ -1855,11 +1887,11 @@ CudaNdarray_gpu_init(PyObject* _unused, PyObject* args) ...@@ -1855,11 +1887,11 @@ CudaNdarray_gpu_init(PyObject* _unused, PyObject* args)
"Unable to get the number of gpus available: %s", "Unable to get the number of gpus available: %s",
cudaGetErrorString(cudaGetLastError())); cudaGetErrorString(cudaGetLastError()));
} }
// as soon as the first successful call to a cuda* function is made, a // as soon as the first successful call to a cuda* function is made, a
// gpu context has been created // gpu context has been created
g_gpu_context_active = 1; g_gpu_context_active = 1;
if(deviceCount <= 0) { if(deviceCount <= 0) {
return PyErr_Format(PyExc_EnvironmentError, return PyErr_Format(PyExc_EnvironmentError,
"Can't use the GPU, no devices support CUDA"); "Can't use the GPU, no devices support CUDA");
...@@ -1917,7 +1949,7 @@ CudaNdarray_Dot(PyObject* _unused, PyObject* args) ...@@ -1917,7 +1949,7 @@ CudaNdarray_Dot(PyObject* _unused, PyObject* args)
PyObject * rval = NULL; PyObject * rval = NULL;
if (! PyArg_ParseTuple(args, "OO", &l, &r)) if (! PyArg_ParseTuple(args, "OO", &l, &r))
return NULL; return NULL;
if (!CudaNdarray_Check(l) || !CudaNdarray_Check(r)) if (!CudaNdarray_Check(l) || !CudaNdarray_Check(r))
{ {
...@@ -1958,7 +1990,7 @@ CudaNdarray_Dot(PyObject* _unused, PyObject* args) ...@@ -1958,7 +1990,7 @@ CudaNdarray_Dot(PyObject* _unused, PyObject* args)
return NULL; return NULL;
} }
static PyObject * static PyObject *
filter(PyObject* __unsed_self, PyObject *args) // args = (data, broadcastable, strict) filter(PyObject* __unsed_self, PyObject *args) // args = (data, broadcastable, strict)
{ {
/* /*
...@@ -1989,7 +2021,7 @@ filter(PyObject* __unsed_self, PyObject *args) // args = (data, broadcastable, s ...@@ -1989,7 +2021,7 @@ filter(PyObject* __unsed_self, PyObject *args) // args = (data, broadcastable, s
if (strict || CudaNdarray_Check(py_data)) if (strict || CudaNdarray_Check(py_data))
{ {
//TODO: support non-strict "casting" from a vt to the broadcastable/type/size that we need. //TODO: support non-strict "casting" from a vt to the broadcastable/type/size that we need.
if (!CudaNdarray_Check(py_data)) if (!CudaNdarray_Check(py_data))
{ {
Py_DECREF(py_data); Py_DECREF(py_data);
Py_DECREF(broadcastable); Py_DECREF(broadcastable);
...@@ -2065,7 +2097,8 @@ static PyMethodDef module_methods[] = { ...@@ -2065,7 +2097,8 @@ static PyMethodDef module_methods[] = {
{"dot", CudaNdarray_Dot, METH_VARARGS, "Returns the matrix product of two CudaNdarray arguments."}, {"dot", CudaNdarray_Dot, METH_VARARGS, "Returns the matrix product of two CudaNdarray arguments."},
{"gpu_init", CudaNdarray_gpu_init, METH_VARARGS, "Select the gpu card to use; also usable to test whether CUDA is available."}, {"gpu_init", CudaNdarray_gpu_init, METH_VARARGS, "Select the gpu card to use; also usable to test whether CUDA is available."},
{"gpu_shutdown", CudaNdarray_gpu_shutdown, METH_VARARGS, "Shut down the gpu."}, {"gpu_shutdown", CudaNdarray_gpu_shutdown, METH_VARARGS, "Shut down the gpu."},
{"filter", filter, METH_VARARGS, "filter(obj, broadcastable, strict, storage) returns a CudaNdarray initialized to obj if it matches the constraints of broadcastable. strict=True prevents any numeric casting. If storage is a CudaNdarray it may be overwritten and used as the return value."}, {"ptr_int_size", CudaNdarray_ptr_int_size, METH_VARARGS, "Return a tuple with the size of gpu pointer, cpu pointer and int in bytes."},
{"filter", filter, METH_VARARGS, "filter(obj, broadcastable, strict, storage) returns a CudaNdarray initialized to obj if it matches the constraints of broadcastable. strict=True prevents any numeric casting. If storage is a CudaNdarray it may be overwritten and used as the return value."},
{"outstanding_mallocs", outstanding_mallocs, METH_VARARGS, "how many more mallocs have been called than free's"}, {"outstanding_mallocs", outstanding_mallocs, METH_VARARGS, "how many more mallocs have been called than free's"},
{NULL, NULL, NULL, NULL} /* Sentinel */ {NULL, NULL, NULL, NULL} /* Sentinel */
}; };
...@@ -2074,7 +2107,7 @@ static PyMethodDef module_methods[] = { ...@@ -2074,7 +2107,7 @@ static PyMethodDef module_methods[] = {
#define PyMODINIT_FUNC void #define PyMODINIT_FUNC void
#endif #endif
PyMODINIT_FUNC PyMODINIT_FUNC
initcuda_ndarray(void) initcuda_ndarray(void)
{ {
import_array(); import_array();
...@@ -2107,10 +2140,10 @@ initcuda_ndarray(void) ...@@ -2107,10 +2140,10 @@ initcuda_ndarray(void)
int deviceId = 0; // TODO: what number goes here? int deviceId = 0; // TODO: what number goes here?
cudaSetDevice(deviceId); cudaSetDevice(deviceId);
cudaError_t err = cudaGetLastError(); cudaError_t err = cudaGetLastError();
if( cudaSuccess != err) if( cudaSuccess != err)
{ {
std::cerr << "Error in SetDevice:" << cudaGetErrorString(err) << "\n"; std::cerr << "Error in SetDevice:" << cudaGetErrorString(err) << "\n";
} }
} }
} }
...@@ -2121,29 +2154,29 @@ initcuda_ndarray(void) ...@@ -2121,29 +2154,29 @@ initcuda_ndarray(void)
// //
////////////////////////////////////// //////////////////////////////////////
int int
CudaNdarray_Check(const PyObject * ob) CudaNdarray_Check(const PyObject * ob)
{ {
//TODO: doesn't work with inheritance //TODO: doesn't work with inheritance
return CudaNdarray_CheckExact(ob); return CudaNdarray_CheckExact(ob);
} }
int int
CudaNdarray_CheckExact(const PyObject * ob) CudaNdarray_CheckExact(const PyObject * ob)
{ {
return ((ob->ob_type == &CudaNdarrayType) ? 1 : 0); return ((ob->ob_type == &CudaNdarrayType) ? 1 : 0);
} }
PyObject * PyObject *
CudaNdarray_New(int nd) CudaNdarray_New(int nd)
{ {
CudaNdarray *self = (CudaNdarray *)CudaNdarrayType.tp_alloc(&CudaNdarrayType, 0); CudaNdarray *self = (CudaNdarray *)CudaNdarrayType.tp_alloc(&CudaNdarrayType, 0);
if (self == NULL) if (self == NULL)
{ {
PyErr_SetString(PyExc_RuntimeError, "CudaNdarray_new_null failed to allocate self"); PyErr_SetString(PyExc_RuntimeError, "CudaNdarray_new_null failed to allocate self");
return NULL; return NULL;
} }
CudaNdarray_null_init(self); CudaNdarray_null_init(self);
if (nd == 0) if (nd == 0)
{ {
self->nd = 0; self->nd = 0;
...@@ -2168,8 +2201,8 @@ CudaNdarray_New(int nd) ...@@ -2168,8 +2201,8 @@ CudaNdarray_New(int nd)
// //
////////////////////////////// //////////////////////////////
int int
cublas_init() cublas_init()
{ {
cublasInit(); cublasInit();
if (CUBLAS_STATUS_SUCCESS != cublasGetError()) if (CUBLAS_STATUS_SUCCESS != cublasGetError())
...@@ -2179,8 +2212,8 @@ cublas_init() ...@@ -2179,8 +2212,8 @@ cublas_init()
} }
return 0; return 0;
} }
int int
cublas_shutdown() cublas_shutdown()
{ {
cublasShutdown(); cublasShutdown();
if (CUBLAS_STATUS_SUCCESS != cublasGetError()) if (CUBLAS_STATUS_SUCCESS != cublasGetError())
...@@ -2191,7 +2224,7 @@ cublas_shutdown() ...@@ -2191,7 +2224,7 @@ cublas_shutdown()
return 0; return 0;
} }
int int
CudaNdarray_CopyFromArray(CudaNdarray * self, PyArrayObject*obj) CudaNdarray_CopyFromArray(CudaNdarray * self, PyArrayObject*obj)
{ {
int err = CudaNdarray_alloc_contiguous(self, obj->nd, obj->dimensions); int err = CudaNdarray_alloc_contiguous(self, obj->nd, obj->dimensions);
...@@ -2211,7 +2244,7 @@ CudaNdarray_CopyFromArray(CudaNdarray * self, PyArrayObject*obj) ...@@ -2211,7 +2244,7 @@ CudaNdarray_CopyFromArray(CudaNdarray * self, PyArrayObject*obj)
return -1; return -1;
} }
cublasSetVector(PyArray_SIZE(py_src), cublasSetVector(PyArray_SIZE(py_src),
sizeof(real), sizeof(real),
PyArray_DATA(py_src), 1, PyArray_DATA(py_src), 1,
self->devdata, 1); self->devdata, 1);
CNDA_THREAD_SYNC; CNDA_THREAD_SYNC;
...@@ -2224,7 +2257,7 @@ CudaNdarray_CopyFromArray(CudaNdarray * self, PyArrayObject*obj) ...@@ -2224,7 +2257,7 @@ CudaNdarray_CopyFromArray(CudaNdarray * self, PyArrayObject*obj)
Py_DECREF(py_src); Py_DECREF(py_src);
return 0; return 0;
} }
bool bool
CudaNdarray_is_c_contiguous(const CudaNdarray * self) CudaNdarray_is_c_contiguous(const CudaNdarray * self)
{ {
bool c_contiguous = true; bool c_contiguous = true;
...@@ -2313,15 +2346,15 @@ int CudaNdarray_CopyFromCudaNdarray(CudaNdarray * self, CudaNdarray * other, boo ...@@ -2313,15 +2346,15 @@ int CudaNdarray_CopyFromCudaNdarray(CudaNdarray * self, CudaNdarray * other, boo
unsigned int size_source = 1; unsigned int size_source = 1;
for (int i = 0; i< self->nd; ++i) for (int i = 0; i< self->nd; ++i)
{ {
if ((CudaNdarray_HOST_DIMS(self)[i] != CudaNdarray_HOST_DIMS(other)[i]) if ((CudaNdarray_HOST_DIMS(self)[i] != CudaNdarray_HOST_DIMS(other)[i])
&& (1!=CudaNdarray_HOST_DIMS(other)[i] || !unbroadcast) ) && (1!=CudaNdarray_HOST_DIMS(other)[i] || !unbroadcast) )
{ {
PyErr_Format(PyExc_ValueError, "need same dimensions for dim %d, destination=%d, source=%d", PyErr_Format(PyExc_ValueError, "need same dimensions for dim %d, destination=%d, source=%d",
i, CudaNdarray_HOST_DIMS(self)[i], CudaNdarray_HOST_DIMS(other)[i]); i, CudaNdarray_HOST_DIMS(self)[i], CudaNdarray_HOST_DIMS(other)[i]);
return -1; return -1;
} }
size *= (unsigned int) CudaNdarray_HOST_DIMS(self)[i]; size *= (unsigned int) CudaNdarray_HOST_DIMS(self)[i];
size_source *= (unsigned int) CudaNdarray_HOST_DIMS(other)[i]; size_source *= (unsigned int) CudaNdarray_HOST_DIMS(other)[i];
} }
if (0 == size) if (0 == size)
{ {
...@@ -2365,11 +2398,11 @@ int CudaNdarray_CopyFromCudaNdarray(CudaNdarray * self, CudaNdarray * other, boo ...@@ -2365,11 +2398,11 @@ int CudaNdarray_CopyFromCudaNdarray(CudaNdarray * self, CudaNdarray * other, boo
CudaNdarray_DEV_DATA(self), CudaNdarray_HOST_STRIDES(self)[0]); CudaNdarray_DEV_DATA(self), CudaNdarray_HOST_STRIDES(self)[0]);
CNDA_THREAD_SYNC; CNDA_THREAD_SYNC;
cudaError_t err = cudaGetLastError(); cudaError_t err = cudaGetLastError();
if( cudaSuccess != err) if( cudaSuccess != err)
{ {
PyErr_Format(PyExc_RuntimeError, "Cuda error: %s: %s. (n_blocks=%i, n_threads_per_block=%i)\n", "k_copy_1d", cudaGetErrorString(err), n_blocks, n_threads); PyErr_Format(PyExc_RuntimeError, "Cuda error: %s: %s. (n_blocks=%i, n_threads_per_block=%i)\n", "k_copy_1d", cudaGetErrorString(err), n_blocks, n_threads);
return -1; return -1;
} }
}; break; }; break;
default: default:
{ {
...@@ -2378,25 +2411,25 @@ int CudaNdarray_CopyFromCudaNdarray(CudaNdarray * self, CudaNdarray * other, boo ...@@ -2378,25 +2411,25 @@ int CudaNdarray_CopyFromCudaNdarray(CudaNdarray * self, CudaNdarray * other, boo
// call worker routine // call worker routine
unsigned int n_blocks = std::min(size, (unsigned int)NUM_VECTOR_OP_BLOCKS); unsigned int n_blocks = std::min(size, (unsigned int)NUM_VECTOR_OP_BLOCKS);
unsigned int threads_per_block = std::min(ceil_intdiv(size, n_blocks), (unsigned int)NUM_VECTOR_OP_THREADS_PER_BLOCK); unsigned int threads_per_block = std::min(ceil_intdiv(size, n_blocks), (unsigned int)NUM_VECTOR_OP_THREADS_PER_BLOCK);
CudaNdarray * cuda_dims = other; CudaNdarray * cuda_dims = other;
if(unbroadcast) if(unbroadcast)
cuda_dims = self; cuda_dims = self;
//copy from other into self //copy from other into self
k_elemwise_unary_rowmajor_copy<<<n_blocks, threads_per_block>>>( k_elemwise_unary_rowmajor_copy<<<n_blocks, threads_per_block>>>(
size, size,
(unsigned int)other->nd, (unsigned int)other->nd,
(const int *)CudaNdarray_DEV_DIMS(cuda_dims), (const int *)CudaNdarray_DEV_DIMS(cuda_dims),
(const float*)CudaNdarray_DEV_DATA(other), (const int *)CudaNdarray_DEV_STRIDES(other), (const float*)CudaNdarray_DEV_DATA(other), (const int *)CudaNdarray_DEV_STRIDES(other),
CudaNdarray_DEV_DATA(self), (const int *)CudaNdarray_DEV_STRIDES(self)); CudaNdarray_DEV_DATA(self), (const int *)CudaNdarray_DEV_STRIDES(self));
CNDA_THREAD_SYNC; CNDA_THREAD_SYNC;
cudaError_t err = cudaGetLastError(); cudaError_t err = cudaGetLastError();
if( cudaSuccess != err) if( cudaSuccess != err)
{ {
//fprint_CudaNdarray(stderr, self); //fprint_CudaNdarray(stderr, self);
//fprint_CudaNdarray(stderr, other); //fprint_CudaNdarray(stderr, other);
PyErr_Format(PyExc_RuntimeError, "Cuda error: %s: %s. (n_blocks=%i, n_threads_per_block=%i)\n", "k_elemwise_unary_rowmajor_copy", cudaGetErrorString(err), n_blocks, threads_per_block); PyErr_Format(PyExc_RuntimeError, "Cuda error: %s: %s. (n_blocks=%i, n_threads_per_block=%i)\n", "k_elemwise_unary_rowmajor_copy", cudaGetErrorString(err), n_blocks, threads_per_block);
return -1; return -1;
} }
} }
}; };
return 0; return 0;
...@@ -2411,7 +2444,7 @@ int CudaNdarray_gemm(float alpha, const CudaNdarray * A, const CudaNdarray * B, ...@@ -2411,7 +2444,7 @@ int CudaNdarray_gemm(float alpha, const CudaNdarray * A, const CudaNdarray * B,
if ((CudaNdarray_HOST_DIMS(A)[1] != CudaNdarray_HOST_DIMS(B)[0]) if ((CudaNdarray_HOST_DIMS(A)[1] != CudaNdarray_HOST_DIMS(B)[0])
|| (CudaNdarray_HOST_DIMS(A)[0] != CudaNdarray_HOST_DIMS(C)[0]) || (CudaNdarray_HOST_DIMS(A)[0] != CudaNdarray_HOST_DIMS(C)[0])
|| (CudaNdarray_HOST_DIMS(B)[1] != CudaNdarray_HOST_DIMS(C)[1])) || (CudaNdarray_HOST_DIMS(B)[1] != CudaNdarray_HOST_DIMS(C)[1]))
{ {
PyErr_Format(PyExc_ValueError, "dimension mismatch in args to gemm (%i,%i)x(%i,%i)->(%i,%i)", PyErr_Format(PyExc_ValueError, "dimension mismatch in args to gemm (%i,%i)x(%i,%i)->(%i,%i)",
CudaNdarray_HOST_DIMS(A)[0], CudaNdarray_HOST_DIMS(A)[0],
CudaNdarray_HOST_DIMS(A)[1], CudaNdarray_HOST_DIMS(A)[1],
...@@ -2419,7 +2452,7 @@ int CudaNdarray_gemm(float alpha, const CudaNdarray * A, const CudaNdarray * B, ...@@ -2419,7 +2452,7 @@ int CudaNdarray_gemm(float alpha, const CudaNdarray * A, const CudaNdarray * B,
CudaNdarray_HOST_DIMS(B)[1], CudaNdarray_HOST_DIMS(B)[1],
CudaNdarray_HOST_DIMS(C)[0], CudaNdarray_HOST_DIMS(C)[0],
CudaNdarray_HOST_DIMS(C)[1]); CudaNdarray_HOST_DIMS(C)[1]);
return -1; return -1;
} }
// a matrix has non-unit size and non-unit stride in both directions, we can't operate in-place // a matrix has non-unit size and non-unit stride in both directions, we can't operate in-place
...@@ -2444,21 +2477,21 @@ int CudaNdarray_gemm(float alpha, const CudaNdarray * A, const CudaNdarray * B, ...@@ -2444,21 +2477,21 @@ int CudaNdarray_gemm(float alpha, const CudaNdarray * A, const CudaNdarray * B,
int unit = 0; int unit = 0;
if (CudaNdarray_HOST_STRIDES(A)[1] == 1 || CudaNdarray_HOST_STRIDES(A)[1] == 0) { if (CudaNdarray_HOST_STRIDES(A)[1] == 1 || CudaNdarray_HOST_STRIDES(A)[1] == 0) {
unit |= (0x0 << 8); unit |= (0x0 << 8);
} else if (CudaNdarray_HOST_STRIDES(A)[0] == 1 || CudaNdarray_HOST_STRIDES(A)[0] == 0) { } else if (CudaNdarray_HOST_STRIDES(A)[0] == 1 || CudaNdarray_HOST_STRIDES(A)[0] == 0) {
unit |= (0x1 << 8); unit |= (0x1 << 8);
} else { } else {
unit |= (0x2 << 8); unit |= (0x2 << 8);
} }
if (CudaNdarray_HOST_STRIDES(B)[1] == 1 || CudaNdarray_HOST_STRIDES(B)[1] == 0) { if (CudaNdarray_HOST_STRIDES(B)[1] == 1 || CudaNdarray_HOST_STRIDES(B)[1] == 0) {
unit |= (0x0 << 4); unit |= (0x0 << 4);
} else if (CudaNdarray_HOST_STRIDES(B)[0] == 1 || CudaNdarray_HOST_STRIDES(B)[0] == 0) { } else if (CudaNdarray_HOST_STRIDES(B)[0] == 1 || CudaNdarray_HOST_STRIDES(B)[0] == 0) {
unit |= (0x1 << 4); unit |= (0x1 << 4);
} else { } else {
unit |= (0x2 << 4); unit |= (0x2 << 4);
} }
if (CudaNdarray_HOST_STRIDES(C)[1] == 1 || CudaNdarray_HOST_STRIDES(C)[1] == 0) { if (CudaNdarray_HOST_STRIDES(C)[1] == 1 || CudaNdarray_HOST_STRIDES(C)[1] == 0) {
unit |= (0x0 << 0); unit |= (0x0 << 0);
} else if (CudaNdarray_HOST_STRIDES(C)[0] == 1 || CudaNdarray_HOST_STRIDES(C)[0] == 0) { } else if (CudaNdarray_HOST_STRIDES(C)[0] == 1 || CudaNdarray_HOST_STRIDES(C)[0] == 0) {
unit |= (0x1 << 0); unit |= (0x1 << 0);
} else { } else {
unit |= (0x2 << 0); unit |= (0x2 << 0);
...@@ -2496,7 +2529,7 @@ int CudaNdarray_gemm(float alpha, const CudaNdarray * A, const CudaNdarray * B, ...@@ -2496,7 +2529,7 @@ int CudaNdarray_gemm(float alpha, const CudaNdarray * A, const CudaNdarray * B,
} else { \ } else { \
PyErr_SetString(PyExc_NotImplementedError, "negative stride to sGemm");\ PyErr_SetString(PyExc_NotImplementedError, "negative stride to sGemm");\
return -1; \ return -1; \
} }
switch(unit) switch(unit)
{ {
...@@ -2588,7 +2621,7 @@ static __global__ void kernel_reduce_sum(const unsigned int size_z, ...@@ -2588,7 +2621,7 @@ static __global__ void kernel_reduce_sum(const unsigned int size_z,
unsigned int pos_d; unsigned int pos_d;
if (log2_dims_a[d]==-1) //TODO: when things are working, use this switch if (log2_dims_a[d]==-1) //TODO: when things are working, use this switch
{ {
// this branch is not preferred, // this branch is not preferred,
// because the manual said that integer mod and div operations are slow on gpu // because the manual said that integer mod and div operations are slow on gpu
pos_d = (ii % dims_a[d]); pos_d = (ii % dims_a[d]);
ii = (ii / dims_a[d]); ii = (ii / dims_a[d]);
...@@ -2604,14 +2637,14 @@ static __global__ void kernel_reduce_sum(const unsigned int size_z, ...@@ -2604,14 +2637,14 @@ static __global__ void kernel_reduce_sum(const unsigned int size_z,
} }
// now we've got pointers a_data_i and z_data_i into element 0 of the slice over which we are reducing // now we've got pointers a_data_i and z_data_i into element 0 of the slice over which we are reducing
// do a similar loop // do a similar loop
float sum = 0.0f; float sum = 0.0f;
switch(n_reduce_dims) switch(n_reduce_dims)
{ {
case 0: case 0:
{ {
sum = a_data_i[0]; sum = a_data_i[0];
} }
break; break;
case 1: case 1:
{ {
...@@ -2644,7 +2677,7 @@ static __global__ void kernel_reduce_sum(const unsigned int size_z, ...@@ -2644,7 +2677,7 @@ static __global__ void kernel_reduce_sum(const unsigned int size_z,
a_data_ri += stride0; a_data_ri += stride0;
} }
} }
}; };
break; break;
default: default:
{ {
...@@ -2662,7 +2695,7 @@ static __global__ void kernel_reduce_sum(const unsigned int size_z, ...@@ -2662,7 +2695,7 @@ static __global__ void kernel_reduce_sum(const unsigned int size_z,
{ {
if (log2_dims_a[rd]==-1) if (log2_dims_a[rd]==-1)
{ {
// this branch is not preferred, // this branch is not preferred,
// because the manual said that integer mod and div operations are slow on gpu // because the manual said that integer mod and div operations are slow on gpu
pos_d = (reduce_ii % dims_a[rd]); pos_d = (reduce_ii % dims_a[rd]);
reduce_ii = (reduce_ii / dims_a[rd]); reduce_ii = (reduce_ii / dims_a[rd]);
...@@ -2736,7 +2769,7 @@ static __global__ void kernel_reduce_sum_1011( ...@@ -2736,7 +2769,7 @@ static __global__ void kernel_reduce_sum_1011(
* Dimensions in which the self has size 1 and A has size > 1 are considered summing dimensions * Dimensions in which the self has size 1 and A has size > 1 are considered summing dimensions
* Dimensions in which self has size > 1 and A has size > 1 are considered non-summing dimensions, and in this case their sizes must be equal. * Dimensions in which self has size > 1 and A has size > 1 are considered non-summing dimensions, and in this case their sizes must be equal.
*/ */
int int
CudaNdarray_reduce_sum(CudaNdarray * self, CudaNdarray * A) CudaNdarray_reduce_sum(CudaNdarray * self, CudaNdarray * A)
{ {
int verbose = 0; int verbose = 0;
...@@ -2799,7 +2832,7 @@ CudaNdarray_reduce_sum(CudaNdarray * self, CudaNdarray * A) ...@@ -2799,7 +2832,7 @@ CudaNdarray_reduce_sum(CudaNdarray * self, CudaNdarray * A)
int n_threads_per_block = std::min(n_summations, int n_threads_per_block = std::min(n_summations,
NUM_VECTOR_OP_THREADS_PER_BLOCK); NUM_VECTOR_OP_THREADS_PER_BLOCK);
int n_blocks = std::min(ceil_intdiv(n_summations,n_threads_per_block), int n_blocks = std::min(ceil_intdiv(n_summations,n_threads_per_block),
NUM_VECTOR_OP_BLOCKS); NUM_VECTOR_OP_BLOCKS);
int n_structure_cache = self->nd * 4 * sizeof(int); int n_structure_cache = self->nd * 4 * sizeof(int);
...@@ -2820,26 +2853,26 @@ CudaNdarray_reduce_sum(CudaNdarray * self, CudaNdarray * A) ...@@ -2820,26 +2853,26 @@ CudaNdarray_reduce_sum(CudaNdarray * self, CudaNdarray * A)
CudaNdarray_DEV_DATA(self)); CudaNdarray_DEV_DATA(self));
CNDA_THREAD_SYNC; CNDA_THREAD_SYNC;
cudaError_t err = cudaGetLastError(); cudaError_t err = cudaGetLastError();
if (cudaSuccess != err) if (cudaSuccess != err)
{ {
PyErr_Format(PyExc_RuntimeError, "Cuda error: %s: %s.\n", "kernel_reduce_sum", cudaGetErrorString(err)); PyErr_Format(PyExc_RuntimeError, "Cuda error: %s: %s.\n", "kernel_reduce_sum", cudaGetErrorString(err));
return -1; return -1;
} }
return 0; return 0;
} }
int int
CudaNdarray_reduce_prod(CudaNdarray * self, const CudaNdarray * A) CudaNdarray_reduce_prod(CudaNdarray * self, const CudaNdarray * A)
{ {
PyErr_SetString(PyExc_NotImplementedError, ""); PyErr_SetString(PyExc_NotImplementedError, "");
return -1; return -1;
} }
int int
CudaNdarray_reduce_min(CudaNdarray * self, const CudaNdarray * A) CudaNdarray_reduce_min(CudaNdarray * self, const CudaNdarray * A)
{ {
PyErr_SetString(PyExc_NotImplementedError, ""); PyErr_SetString(PyExc_NotImplementedError, "");
return -1; return -1;
} }
int int
CudaNdarray_reduce_max(CudaNdarray * self, const CudaNdarray * A) CudaNdarray_reduce_max(CudaNdarray * self, const CudaNdarray * A)
{ {
PyErr_SetString(PyExc_NotImplementedError, ""); PyErr_SetString(PyExc_NotImplementedError, "");
...@@ -2855,7 +2888,7 @@ CudaNdarray_reduce_max(CudaNdarray * self, const CudaNdarray * A) ...@@ -2855,7 +2888,7 @@ CudaNdarray_reduce_max(CudaNdarray * self, const CudaNdarray * A)
* For example, if CudaNdarray_HOST_DIMS(self) == [4, 5, 1, 6], and pattern = [0,3,-1,-1, 1], then CudaNdarray_HOST_DIMS(self) would be modified to become: * For example, if CudaNdarray_HOST_DIMS(self) == [4, 5, 1, 6], and pattern = [0,3,-1,-1, 1], then CudaNdarray_HOST_DIMS(self) would be modified to become:
* [4, 6, 1, 1, 5] (we dropped the original dim[2]==1, and inserted two singleton dimensions with the -1s. * [4, 6, 1, 1, 5] (we dropped the original dim[2]==1, and inserted two singleton dimensions with the -1s.
*/ */
int int
CudaNdarray_dimshuffle(CudaNdarray * self, unsigned int len, const int * pattern) CudaNdarray_dimshuffle(CudaNdarray * self, unsigned int len, const int * pattern)
{ {
//TODO: pass a workspace pointer to avoid the internal malloc //TODO: pass a workspace pointer to avoid the internal malloc
......
...@@ -56,7 +56,7 @@ class InputToGpuOptimizer(Optimizer): ...@@ -56,7 +56,7 @@ class InputToGpuOptimizer(Optimizer):
new_input = host_from_gpu(gpu_from_host(input)) new_input = host_from_gpu(gpu_from_host(input))
if new_input.type==input.type: if new_input.type==input.type:
env.replace_validate(input, new_input, "To allow further optimisation to move Ops to gpu") env.replace_validate(input, new_input, "InputToGpuOptimizer")
except Exception, e: except Exception, e:
#as we currently only support float32, this can fail. #as we currently only support float32, this can fail.
#Using try except make that we won't need #Using try except make that we won't need
...@@ -113,9 +113,12 @@ def local_gpu_elemwise_0(node): ...@@ -113,9 +113,12 @@ def local_gpu_elemwise_0(node):
else: else:
return False return False
gpu_elemwise = split_huge_add_or_mul(gpu_elemwise.owner).outputs[0] gpu_elemwise = split_huge_add_or_mul(gpu_elemwise.owner)
if not gpu_elemwise:
return [host_from_gpu(gpu_elemwise)] return False
if max_inputs_to_GpuElemwise(node)<len(gpu_elemwise.inputs):
return False
return [host_from_gpu(gpu_elemwise.outputs[0])]
@register_opt() @register_opt()
@local_optimizer([]) @local_optimizer([])
def local_gpu_elemwise_1(node): def local_gpu_elemwise_1(node):
...@@ -130,8 +133,10 @@ def local_gpu_elemwise_1(node): ...@@ -130,8 +133,10 @@ def local_gpu_elemwise_1(node):
new_op = GpuElemwise(elemwise_node.op.scalar_op) new_op = GpuElemwise(elemwise_node.op.scalar_op)
if all([i.dtype=='float32' for i in elemwise_node.inputs]): if all([i.dtype=='float32' for i in elemwise_node.inputs]):
gpu_elemwise = new_op(*[gpu_from_host(i) for i in elemwise_node.inputs]) gpu_elemwise = new_op(*[gpu_from_host(i) for i in elemwise_node.inputs])
gpu_elemwise = split_huge_add_or_mul(gpu_elemwise.owner).outputs[0] gpu_elemwise = split_huge_add_or_mul(gpu_elemwise.owner)
return [gpu_elemwise] if not gpu_elemwise:
return False
return [gpu_elemwise.outputs[0]]
return False return False
@register_opt() @register_opt()
...@@ -730,24 +735,35 @@ optdb.register('InplaceGpuBlasOpt', ...@@ -730,24 +735,35 @@ optdb.register('InplaceGpuBlasOpt',
max_use_ratio=5), max_use_ratio=5),
70.0, 'fast_run', 'inplace') 70.0, 'fast_run', 'inplace')
gpu_ptr_size = 8
cpu_ptr_size = 8
int_size = 8
try:
#RETURN (gpu ptr size, cpu ptr size, int sizes)
t = cuda_ndarray.cuda_ndarray.ptr_int_size()
gpu_ptr_size, cpu_ptr_size, int_size = t
except Exception, e:
_logger.warning(("OPTIMIZATION WARNING: "
"Got the following error, but we can ignore it. "
"This could cause less GpuElemwise fused together.\n"
"%s") % e)
def max_inputs_to_GpuElemwise(node): def max_inputs_to_GpuElemwise(node):
""" """
return the maximum number of input this Apply node to an GpuElemwise can accept. return the maximum number of input this Apply node to an GpuElemwise can accept.
This is needed as currently their is a limit of 256 bytes of paramter for the gpu function. This is needed as currently their is a limit of 256 bytes of paramter for the gpu function.
This mesure the number of paramter we put in our gpu function and compute the maximum number of inputs that respect the 256 bytes limits. This mesure the number of paramter we put in our gpu function and compute the maximum number of inputs that respect the 256 bytes limits.
""" """
#TODO: detect the size of gpu pointeur and c int.
int_size = 8
ptr_size = 8
argument_limit = 256 # if was 240, with this note: 16 bytes are used for block and thread coords etc. argument_limit = 232 # some bytes are used for block and thread coords etc.
ndim = node.inputs[0].type.ndim
size_param_mandatory = int_size #for numels size_param_mandatory = int_size #for numels
size_param_mandatory += int_size * node.inputs[0].type.ndim # for the shape#node.outputs[0].ndim+1+node.inputs[0].ndim+1 size_param_mandatory += int_size * ndim # for the shape
size_param_mandatory += sum((ptr_size + int_size * i.type.ndim) for i in node.outputs) size_param_mandatory += sum((gpu_ptr_size + int_size * ndim) for i in node.outputs)
nb_bytes_avail = argument_limit-size_param_mandatory nb_bytes_avail = argument_limit - size_param_mandatory
nb_bytes_per_inputs = (node.inputs[0].ndim*int_size)+ptr_size nb_bytes_per_inputs = (ndim*int_size) + gpu_ptr_size
max_nb_inputs = nb_bytes_avail//nb_bytes_per_inputs max_nb_inputs = nb_bytes_avail // nb_bytes_per_inputs
return max_nb_inputs return max_nb_inputs
def split_huge_add_or_mul(node): def split_huge_add_or_mul(node):
...@@ -762,6 +778,8 @@ def split_huge_add_or_mul(node): ...@@ -762,6 +778,8 @@ def split_huge_add_or_mul(node):
""" """
if node.op.scalar_op in (scal.add, scal.mul): if node.op.scalar_op in (scal.add, scal.mul):
max_nb_inputs = max_inputs_to_GpuElemwise(node) max_nb_inputs = max_inputs_to_GpuElemwise(node)
if max_nb_inputs<=1 and len(node.inputs)>1:
return False
while len(node.inputs)>max_nb_inputs: while len(node.inputs)>max_nb_inputs:
inner_op = [] inner_op = []
for i in range(0,len(node.inputs),max_nb_inputs): for i in range(0,len(node.inputs),max_nb_inputs):
......
...@@ -161,8 +161,9 @@ def test_huge_elemwise_fusion(): ...@@ -161,8 +161,9 @@ def test_huge_elemwise_fusion():
in case their is too many inputs and that would make it bust the 256 in case their is too many inputs and that would make it bust the 256
bytes limits. bytes limits.
""" """
shape = (3,4,5,6) shape = (2,3,4,5,6)
vars = [tensor.tanh(tensor.ftensor4()) for x in range(10)] ttype = tensor.tensor(dtype='float32',broadcastable=(False,)*len(shape))
vars = [tensor.tanh(ttype) for x in range(10)]
f = pfunc(vars, [vars[0]-vars[1]-vars[2]-vars[3]-vars[4]-vars[5]-vars[6]], mode=mode_with_gpu) f = pfunc(vars, [vars[0]-vars[1]-vars[2]-vars[3]-vars[4]-vars[5]-vars[6]], mode=mode_with_gpu)
topo = f.maker.env.toposort() topo = f.maker.env.toposort()
#theano.printing.debugprint(f) #theano.printing.debugprint(f)
...@@ -170,12 +171,29 @@ def test_huge_elemwise_fusion(): ...@@ -170,12 +171,29 @@ def test_huge_elemwise_fusion():
# print >> sys.stdout, i, node # print >> sys.stdout, i, node
assert len(topo)==10 assert len(topo)==10
assert sum([isinstance(node.op, cuda.GpuElemwise) for node in topo])==2 assert sum([isinstance(node.op, cuda.GpuElemwise) for node in topo])==2
assert isinstance(topo[7].op.scalar_op,theano.scalar.basic.Composite) assert isinstance(topo[7].op.scalar_op,theano.scalar.basic.Sub)
assert isinstance(topo[8].op.scalar_op,theano.scalar.basic.Composite) assert isinstance(topo[8].op.scalar_op,theano.scalar.basic.Composite)
#let debugmode catch errors #let debugmode catch errors
gen = lambda : theano._asarray(numpy.random.rand(*shape), dtype='float32') gen = lambda : theano._asarray(numpy.random.rand(*shape), dtype='float32')
f(gen(),gen(),gen(),gen(),gen(),gen(),gen(),gen(),gen(),gen()) f(gen(),gen(),gen(),gen(),gen(),gen(),gen(),gen(),gen(),gen())
# Test the case where we can't put the computation on the gpu! their is too many
# dimensions to the input to have 2 inputs to the op!
shape = (1,2,3,4,5,6,7,2,2,3,2,1,2,2,2,)
ttype = tensor.tensor(dtype='float32',broadcastable=(False,)*len(shape))
vars = [tensor.tanh(ttype) for x in range(10)]
f = pfunc(vars, [vars[0]-vars[1]-vars[2]-vars[3]-vars[4]-vars[5]-vars[6]], mode=mode_with_gpu)
topo = f.maker.env.toposort()
#theano.printing.debugprint(f)
assert len(topo)==1
assert sum([isinstance(node.op, cuda.GpuElemwise) for node in topo])==0
assert sum([isinstance(node.op, tensor.Elemwise) for node in topo])==1
#let debugmode catch errors
gen = lambda : theano._asarray(numpy.random.rand(*shape), dtype='float32')
f(gen(),gen(),gen(),gen(),gen(),gen(),gen(),gen(),gen(),gen())
def test_elemwise_fusion(): def test_elemwise_fusion():
""" Test the the GpuElemwise fusion work correctly""" """ Test the the GpuElemwise fusion work correctly"""
shape = (3,4) shape = (3,4)
......
...@@ -10,6 +10,7 @@ from theano import scalar as scal ...@@ -10,6 +10,7 @@ from theano import scalar as scal
try: try:
# We must do those import to be able to create the full doc when nvcc # We must do those import to be able to create the full doc when nvcc
# is not available
import cuda_ndarray.cuda_ndarray as cuda import cuda_ndarray.cuda_ndarray as cuda
from theano.sandbox.cuda.nvcc_compiler import nvcc_module_compile_str from theano.sandbox.cuda.nvcc_compiler import nvcc_module_compile_str
import cuda_ndarray import cuda_ndarray
......
...@@ -10,6 +10,7 @@ from theano.compile import SharedVariable ...@@ -10,6 +10,7 @@ from theano.compile import SharedVariable
from theano.sandbox.cuda.type import CudaNdarrayType from theano.sandbox.cuda.type import CudaNdarrayType
try: try:
# We must do those import to be able to create the full doc when nvcc # We must do those import to be able to create the full doc when nvcc
# is not available
from theano.sandbox.cuda import filter as type_support_filter from theano.sandbox.cuda import filter as type_support_filter
from theano.sandbox.cuda.basic_ops import HostFromGpu, GpuFromHost from theano.sandbox.cuda.basic_ops import HostFromGpu, GpuFromHost
except ImportError: except ImportError:
......
...@@ -74,7 +74,7 @@ class Scalar(Type): ...@@ -74,7 +74,7 @@ class Scalar(Type):
py_type = self.dtype_specs()[0] py_type = self.dtype_specs()[0]
if strict and not isinstance(data, py_type): if strict and not isinstance(data, py_type):
raise TypeError("%s expected a %s, got %s of type %s" % (self, py_type, data, raise TypeError("%s expected a %s, got %s of type %s" % (self, py_type, data,
type(data)), type(data)),
data) data)
try: try:
converted_data = py_type(data) converted_data = py_type(data)
...@@ -160,11 +160,11 @@ class Scalar(Type): ...@@ -160,11 +160,11 @@ class Scalar(Type):
return """ return """
%(name)s = 0; %(name)s = 0;
""" % locals() """ % locals()
def c_extract(self, name, sub): def c_extract(self, name, sub):
specs = self.dtype_specs() specs = self.dtype_specs()
#TODO: This is the wrong code, but we don't know what to change it to. #TODO: This is the wrong code, but we don't know what to change it to.
# For example, a numpy.uint8 is not a PyInt, so PyInt_Check # For example, a numpy.uint8 is not a PyInt, so PyInt_Check
# is simply the wrong function to # is simply the wrong function to
# call. # call.
# Look at PyArrayScalar api for how to cast to/from PyArrayScalar objects. # Look at PyArrayScalar api for how to cast to/from PyArrayScalar objects.
...@@ -183,7 +183,7 @@ class Scalar(Type): ...@@ -183,7 +183,7 @@ class Scalar(Type):
dtype = specs[1], dtype = specs[1],
check = specs[2], check = specs[2],
conv = specs[3]) conv = specs[3])
def c_sync(self, name, sub): def c_sync(self, name, sub):
specs = self.dtype_specs() specs = self.dtype_specs()
return """ return """
...@@ -371,7 +371,7 @@ class _scalar_py_operators: ...@@ -371,7 +371,7 @@ class _scalar_py_operators:
#def __complex__(self): return AsComplex(self).out #def __complex__(self): return AsComplex(self).out
#BITWISE #BITWISE
def __invert__(self): return invert(self) def __invert__(self): return invert(self)
def __and__(self,other): return and_(self, other) def __and__(self,other): return and_(self, other)
def __or__(self,other): return or_(self, other) def __or__(self,other): return or_(self, other)
def __xor__(self,other): return xor(self, other) def __xor__(self,other): return xor(self, other)
...@@ -432,6 +432,10 @@ complexs128 = _multi(complex128) ...@@ -432,6 +432,10 @@ complexs128 = _multi(complex128)
def upcast_out(*types): def upcast_out(*types):
return Scalar(dtype = Scalar.upcast(*types)), return Scalar(dtype = Scalar.upcast(*types)),
def upcast_out_no_complex(*types):
if any([type in complex_types for type in types]):
raise TypeError('complex type are not supported')
return Scalar(dtype = Scalar.upcast(*types)),
def same_out(type): def same_out(type):
return type, return type,
def same_out_float_only(type): def same_out_float_only(type):
...@@ -449,10 +453,10 @@ class transfer_type(gof.utils.object2): ...@@ -449,10 +453,10 @@ class transfer_type(gof.utils.object2):
upcast = upcast_out(*types) upcast = upcast_out(*types)
retval = [] retval = []
for i in self.transfer: for i in self.transfer:
if i is None: if i is None:
retval += [upcast] retval += [upcast]
else: else:
retval += [types[i]] retval += [types[i]]
return retval return retval
#return [upcast if i is None else types[i] for i in self.transfer] #return [upcast if i is None else types[i] for i in self.transfer]
def __eq__(self, other): def __eq__(self, other):
...@@ -481,6 +485,14 @@ def upgrade_to_float(*types): ...@@ -481,6 +485,14 @@ def upgrade_to_float(*types):
int32: float64, int32: float64,
int64: float64} int64: float64}
return Scalar(Scalar.upcast(*[conv.get(type, type) for type in types])), return Scalar(Scalar.upcast(*[conv.get(type, type) for type in types])),
def upgrade_to_float_no_complex(*types):
"""
don't accept complex, otherwise call upgrade_to_float().
"""
for type in types:
if type in complex_types:
raise TypeError('complex argument not supported')
return upgrade_to_float(*types)
def same_out_nocomplex(type): def same_out_nocomplex(type):
if type in complex_types: if type in complex_types:
raise TypeError('complex argument not supported') raise TypeError('complex argument not supported')
...@@ -577,7 +589,7 @@ class ScalarOp(Op): ...@@ -577,7 +589,7 @@ class ScalarOp(Op):
def impl(self, *inputs): def impl(self, *inputs):
raise utils.MethodNotDefined("impl", type(self), self.__class__.__name__) raise utils.MethodNotDefined("impl", type(self), self.__class__.__name__)
def grad(self, inputs, output_gradients): def grad(self, inputs, output_gradients):
raise utils.MethodNotDefined("grad", type(self), self.__class__.__name__) raise utils.MethodNotDefined("grad", type(self), self.__class__.__name__)
...@@ -622,8 +634,11 @@ class LT(LogicalComparison): ...@@ -622,8 +634,11 @@ class LT(LogicalComparison):
commutative = False commutative = False
associative = False associative = False
def impl(self, x, y): def impl(self, x, y):
return x < y # built-in < don't support complex
return numpy.less(x, y)
def c_code(self, node, name, (x, y), (z, ), sub): def c_code(self, node, name, (x, y), (z, ), sub):
if node.inputs[0].type in complex_types:
raise NotImplementedError()
return "%(z)s = (%(x)s < %(y)s);" % locals() return "%(z)s = (%(x)s < %(y)s);" % locals()
lt = LT() lt = LT()
...@@ -632,8 +647,11 @@ class GT(LogicalComparison): ...@@ -632,8 +647,11 @@ class GT(LogicalComparison):
commutative = False commutative = False
associative = False associative = False
def impl(self, x, y): def impl(self, x, y):
return x > y # built-in > don't support complex
return numpy.greater(x, y)
def c_code(self, node, name, (x, y), (z, ), sub): def c_code(self, node, name, (x, y), (z, ), sub):
if node.inputs[0].type in complex_types:
raise NotImplementedError()
return "%(z)s = (%(x)s > %(y)s);" % locals() return "%(z)s = (%(x)s > %(y)s);" % locals()
gt = GT() gt = GT()
...@@ -642,8 +660,11 @@ class LE(LogicalComparison): ...@@ -642,8 +660,11 @@ class LE(LogicalComparison):
commutative = False commutative = False
associative = False associative = False
def impl(self, x, y): def impl(self, x, y):
return x <= y # built-in <= don't support complex
return numpy.less_equal(x, y)
def c_code(self, node, name, (x, y), (z, ), sub): def c_code(self, node, name, (x, y), (z, ), sub):
if node.inputs[0].type in complex_types:
raise NotImplementedError()
return "%(z)s = (%(x)s <= %(y)s);" % locals() return "%(z)s = (%(x)s <= %(y)s);" % locals()
le = LE() le = LE()
...@@ -652,8 +673,11 @@ class GE(LogicalComparison): ...@@ -652,8 +673,11 @@ class GE(LogicalComparison):
commutative = False commutative = False
associative = False associative = False
def impl(self, x, y): def impl(self, x, y):
return x >= y # built-in >= don't support complex
return numpy.greater_equal(x, y)
def c_code(self, node, name, (x, y), (z, ), sub): def c_code(self, node, name, (x, y), (z, ), sub):
if node.inputs[0].type in complex_types:
raise NotImplementedError()
return "%(z)s = (%(x)s >= %(y)s);" % locals() return "%(z)s = (%(x)s >= %(y)s);" % locals()
ge = GE() ge = GE()
...@@ -664,6 +688,8 @@ class EQ(LogicalComparison): ...@@ -664,6 +688,8 @@ class EQ(LogicalComparison):
def impl(self, x, y): def impl(self, x, y):
return x == y return x == y
def c_code(self, node, name, (x, y), (z, ), sub): def c_code(self, node, name, (x, y), (z, ), sub):
if node.inputs[0].type in complex_types:
raise NotImplementedError()
return "%(z)s = (%(x)s == %(y)s);" % locals() return "%(z)s = (%(x)s == %(y)s);" % locals()
eq = EQ() eq = EQ()
...@@ -674,6 +700,8 @@ class NEQ(LogicalComparison): ...@@ -674,6 +700,8 @@ class NEQ(LogicalComparison):
def impl(self, x, y): def impl(self, x, y):
return x != y return x != y
def c_code(self, node, name, (x, y), (z, ), sub): def c_code(self, node, name, (x, y), (z, ), sub):
if node.inputs[0].type in complex_types:
raise NotImplementedError()
return "%(z)s = (%(x)s != %(y)s);" % locals() return "%(z)s = (%(x)s != %(y)s);" % locals()
neq = NEQ() neq = NEQ()
...@@ -694,17 +722,17 @@ class InRange(LogicalComparison): ...@@ -694,17 +722,17 @@ class InRange(LogicalComparison):
return True return True
def c_code(self, node, name, (x, low, hi), (z, ), sub): def c_code(self, node, name, (x, low, hi), (z, ), sub):
if self.openlow: if self.openlow:
cmp1 = '>' cmp1 = '>'
else: else:
cmp1 = '>=' cmp1 = '>='
#backport #backport
#cmp1 = '>' if self.openlow else '>=' #cmp1 = '>' if self.openlow else '>='
if self.openhi: if self.openhi:
cmp2 = '<' cmp2 = '<'
else: else:
cmp2 = '<=' cmp2 = '<='
#backport #backport
#cmp2 = '<' if self.openhi else '<=' #cmp2 = '<' if self.openhi else '<='
...@@ -717,25 +745,25 @@ inclosedrange = InRange(False, False) ...@@ -717,25 +745,25 @@ inclosedrange = InRange(False, False)
class Switch(ScalarOp): class Switch(ScalarOp):
nin = 3 nin = 3
def impl(self, cond, ift, iff): def impl(self, cond, ift, iff):
if cond: if cond:
return ift return ift
else: else:
return iff return iff
#backport #backport
#return ift if cond else iff #return ift if cond else iff
def c_code(self, node, name, (cond, ift, iff), (z, ), sub): def c_code(self, node, name, (cond, ift, iff), (z, ), sub):
return "%(z)s = %(cond)s ? %(ift)s : %(iff)s;" % locals() return "%(z)s = %(cond)s ? %(ift)s : %(iff)s;" % locals()
def grad(self, (cond, ift, iff), (gz, )): def grad(self, (cond, ift, iff), (gz, )):
if ift.type in continuous_types: if ift.type in continuous_types:
first_part = switch(cond, gz, 0) first_part = switch(cond, gz, 0)
else: else:
first_part = None first_part = None
if iff.type in continuous_types: if iff.type in continuous_types:
second_part = switch(cond, 0, gz) second_part = switch(cond, 0, gz)
else: else:
second_part = None second_part = None
return (None, first_part, second_part) return (None, first_part, second_part)
...@@ -751,7 +779,7 @@ class UnaryBitOp(UnaryScalarOp): ...@@ -751,7 +779,7 @@ class UnaryBitOp(UnaryScalarOp):
def output_types(self, *input_types): def output_types(self, *input_types):
for i in input_types[0]: for i in input_types[0]:
if i not in (int8, int32, int64): if i not in (int8, int32, int64):
raise TypeError('input to a BitOp must have type int8, int32 or int 64... not %s' % i) raise TypeError('input to a BitOp must have type int8, int32 or int64... not %s' % i)
return upcast_out(*input_types[0]) return upcast_out(*input_types[0])
def grad(self, inputs, output_gradients): def grad(self, inputs, output_gradients):
return [None] return [None]
...@@ -761,7 +789,7 @@ class BinaryBitOp(BinaryScalarOp): ...@@ -761,7 +789,7 @@ class BinaryBitOp(BinaryScalarOp):
t0, t1 = input_types[0] t0, t1 = input_types[0]
for i in input_types[0]: for i in input_types[0]:
if i not in (int8, int32, int64): if i not in (int8, int32, int64):
raise TypeError('input to a BitOp must have type int8, int32 or int 64... not %s' % i) raise TypeError('input to a BitOp must have type int8, int32 or int64... not %s' % i)
return upcast_out(*input_types[0]) return upcast_out(*input_types[0])
def grad(self, inputs, output_gradients): def grad(self, inputs, output_gradients):
return [None, None] return [None, None]
...@@ -806,19 +834,22 @@ class Maximum(BinaryScalarOp): ...@@ -806,19 +834,22 @@ class Maximum(BinaryScalarOp):
commutative = True commutative = True
associative = True associative = True
def impl(self, *inputs): def impl(self, *inputs):
return max(inputs) # The built-in max function don't support complex type
return numpy.maximum(*inputs)
def c_code(self, node, name, (x,y), (z, ), sub): def c_code(self, node, name, (x,y), (z, ), sub):
if any([i.type in complex_types for i in node.inputs]):
raise NotImplementedError()
return "%(z)s = ((%(y)s)>(%(x)s)? (%(y)s):(%(x)s));" %locals() return "%(z)s = ((%(y)s)>(%(x)s)? (%(y)s):(%(x)s));" %locals()
def grad(self, (x, y), (gz, )): def grad(self, (x, y), (gz, )):
assert gz.type not in complex_types assert gz.type not in complex_types
# max is not defined for complex_types # max is not defined for complex_types
gx, gy = None, None gx, gy = None, None
if x.type in float_types: if x.type in float_types:
gx = eq(maximum(x,y), x)*gz gx = eq(maximum(x,y), x)*gz
if y.type in float_types: if y.type in float_types:
gy = eq(maximum(x,y), y)*gz gy = eq(maximum(x,y), y)*gz
return (gx,gy) return (gx,gy)
maximum = Maximum(upcast_out, name = 'maximum') maximum = Maximum(upcast_out, name = 'maximum')
...@@ -826,19 +857,22 @@ class Minimum(BinaryScalarOp): ...@@ -826,19 +857,22 @@ class Minimum(BinaryScalarOp):
commutative = True commutative = True
associative = True associative = True
def impl(self, *inputs): def impl(self, *inputs):
return min(inputs) # The built-in min function don't support complex type
return numpy.minimum(*inputs)
def c_code(self, node, name, (x,y), (z, ), sub): def c_code(self, node, name, (x,y), (z, ), sub):
if any([i.type in complex_types for i in node.inputs]):
raise NotImplementedError()
return "%(z)s = ((%(y)s)<(%(x)s)? (%(y)s):(%(x)s));" %locals() return "%(z)s = ((%(y)s)<(%(x)s)? (%(y)s):(%(x)s));" %locals()
def grad(self, (x, y), (gz, )): def grad(self, (x, y), (gz, )):
assert gz.type not in complex_types assert gz.type not in complex_types
# max is not defined for complex_types # max is not defined for complex_types
gx, gy = None, None gx, gy = None, None
if x.type in float_types: if x.type in float_types:
gx = eq(minimum(x,y), x)*gz gx = eq(minimum(x,y), x)*gz
if y.type in float_types: if y.type in float_types:
gy = eq(minimum(x,y), y)*gz gy = eq(minimum(x,y), y)*gz
return (gx,gy) return (gx,gy)
minimum = Minimum(upcast_out, name = 'minimum') minimum = Minimum(upcast_out, name = 'minimum')
...@@ -886,24 +920,24 @@ class Mul(ScalarOp): ...@@ -886,24 +920,24 @@ class Mul(ScalarOp):
else: else:
return z + " = " + " * ".join(inputs) + ";" return z + " = " + " * ".join(inputs) + ";"
def grad(self, inputs, (gz, )): def grad(self, inputs, (gz, )):
retval = [] retval = []
for input in inputs: for input in inputs:
if input.type in continuous_types: if input.type in continuous_types:
if gz.type in complex_types: if gz.type in complex_types:
# zr+zi = (xr + xi)(yr + yi) # zr+zi = (xr + xi)(yr + yi)
# zr+zi = (xr*yr - xi*yi) + (xr yi + xi yr ) # zr+zi = (xr*yr - xi*yi) + (xr yi + xi yr )
otherprod = mul(*(utils.difference(inputs, [input]))) otherprod = mul(*(utils.difference(inputs, [input])))
yr = real(otherprod) yr = real(otherprod)
yi = imag(otherprod) yi = imag(otherprod)
if input.type in complex_types: if input.type in complex_types:
retval += [complex(yr*real(gz)+yi*imag(gz), yr*imag(gz)-yi*real(gz))] retval += [complex(yr*real(gz)+yi*imag(gz), yr*imag(gz)-yi*real(gz))]
else:
retval += [cast(yr*real(gz)+yi*imag(gz), input.type.dtype)]
else:
retval += [cast(mul(*([gz] + utils.difference(inputs, [input]))), input.type.dtype)]
else: else:
retval += [cast(yr*real(gz)+yi*imag(gz), input.type.dtype)] retval += [None]
else: return retval
retval += [cast(mul(*([gz] + utils.difference(inputs, [input]))), input.type.dtype)]
else:
retval += [None]
return retval
mul = Mul(upcast_out, name = 'mul') mul = Mul(upcast_out, name = 'mul')
class Sub(BinaryScalarOp): class Sub(BinaryScalarOp):
...@@ -949,6 +983,9 @@ class TrueDiv(BinaryScalarOp): ...@@ -949,6 +983,9 @@ class TrueDiv(BinaryScalarOp):
else: else:
return x / y return x / y
def c_code(self, node, name, (x, y), (z, ), sub): def c_code(self, node, name, (x, y), (z, ), sub):
#we generate good c code only when both are complex!
if sum([node.inputs[0].type in complex_types, node.inputs[1].type in complex_types])==1:
raise NotImplementedError('type not supported', type)
if node.inputs[0].type in int_types and node.inputs[1].type in int_types: if node.inputs[0].type in int_types and node.inputs[1].type in int_types:
return "%(z)s = ((double)%(x)s) / %(y)s;" % locals() return "%(z)s = ((double)%(x)s) / %(y)s;" % locals()
return "%(z)s = %(x)s / %(y)s;" % locals() return "%(z)s = %(x)s / %(y)s;" % locals()
...@@ -956,14 +993,14 @@ class TrueDiv(BinaryScalarOp): ...@@ -956,14 +993,14 @@ class TrueDiv(BinaryScalarOp):
if x.type in complex_types: if x.type in complex_types:
raise NotImplementedError() raise NotImplementedError()
if x.type in float_types: if x.type in float_types:
first_part = cast(gz / y, x.type.dtype) first_part = cast(gz / y, x.type.dtype)
else: else:
first_part = None first_part = None
if y.type in float_types: if y.type in float_types:
second_part = cast(-(gz * x) / (y * y), y.type.dtype) second_part = cast(-(gz * x) / (y * y), y.type.dtype)
else: else:
second_part = None second_part = None
return first_part, second_part return first_part, second_part
true_div = TrueDiv(upcast_out, name = 'true_div') true_div = TrueDiv(upcast_out, name = 'true_div')
...@@ -1006,7 +1043,7 @@ class Mod(BinaryScalarOp): ...@@ -1006,7 +1043,7 @@ class Mod(BinaryScalarOp):
x_mod_ymp = "fmod(-%(x)s,%(y)s)"%locals() x_mod_ymp = "fmod(-%(x)s,%(y)s)"%locals()
else: else:
raise NotImplementedError('type not supported', type) raise NotImplementedError('type not supported', type)
return """ return """
if (%(x)s < 0){ if (%(x)s < 0){
if (%(y)s < 0){ if (%(y)s < 0){
...@@ -1028,6 +1065,8 @@ class Pow(BinaryScalarOp): ...@@ -1028,6 +1065,8 @@ class Pow(BinaryScalarOp):
def impl(self, x, y): def impl(self, x, y):
return x ** y return x ** y
def c_code(self, node, name, (x, y), (z, ), sub): def c_code(self, node, name, (x, y), (z, ), sub):
if node.inputs[0].type in complex_types or node.inputs[1].type in complex_types:
raise NotImplementedError('type not supported', type)
return "%(z)s = pow(%(x)s, %(y)s);" % locals() return "%(z)s = pow(%(x)s, %(y)s);" % locals()
def grad(self, (x, y), (gz, )): def grad(self, (x, y), (gz, )):
if gz.type in complex_types: if gz.type in complex_types:
...@@ -1038,7 +1077,7 @@ class Pow(BinaryScalarOp): ...@@ -1038,7 +1077,7 @@ class Pow(BinaryScalarOp):
first_part = None first_part = None
if y.type in float_types: if y.type in float_types:
second_part = gz * log(x) * x**y second_part = gz * log(x) * x**y
else: else:
second_part = None second_part = None
...@@ -1073,9 +1112,9 @@ class First(BinaryScalarOp): ...@@ -1073,9 +1112,9 @@ class First(BinaryScalarOp):
return "%(z)s = %(x)s;" % locals() return "%(z)s = %(x)s;" % locals()
def grad(self, (x, y), (gz, )): def grad(self, (x, y), (gz, )):
if x.type in continuous_types: if x.type in continuous_types:
return gz, None return gz, None
else: else:
return None,None return None,None
first = First(transfer_type(0), name = 'first') first = First(transfer_type(0), name = 'first')
class Second(BinaryScalarOp): class Second(BinaryScalarOp):
...@@ -1085,9 +1124,9 @@ class Second(BinaryScalarOp): ...@@ -1085,9 +1124,9 @@ class Second(BinaryScalarOp):
return "%(z)s = %(y)s;" % locals() return "%(z)s = %(y)s;" % locals()
def grad(self, (x, y), (gz, )): def grad(self, (x, y), (gz, )):
if y.type in continuous_types: if y.type in continuous_types:
return None, gz return None, gz
else: else:
return None return None
second = Second(transfer_type(1), name = 'second') second = Second(transfer_type(1), name = 'second')
...@@ -1121,9 +1160,9 @@ class Cast(UnaryScalarOp): ...@@ -1121,9 +1160,9 @@ class Cast(UnaryScalarOp):
return "%s = (%s)%s;" % (z, node.outputs[0].type.dtype_specs()[1], x) return "%s = (%s)%s;" % (z, node.outputs[0].type.dtype_specs()[1], x)
def grad(self, (x, ), (gz, )): def grad(self, (x, ), (gz, )):
if x.type in continuous_types: if x.type in continuous_types:
return [cast(gz, x.type.dtype)] return [cast(gz, x.type.dtype)]
else: else:
return None, return None,
def c_code_cache_version(self): def c_code_cache_version(self):
s = super(Cast, self).c_code_cache_version() s = super(Cast, self).c_code_cache_version()
if s: if s:
...@@ -1158,7 +1197,7 @@ _cast_mapping = { ...@@ -1158,7 +1197,7 @@ _cast_mapping = {
'complex64': convert_to_complex64, 'complex64': convert_to_complex64,
'complex128': convert_to_complex128} 'complex128': convert_to_complex128}
def cast(x, dtype): def cast(x, dtype):
"""Symbolically cast `x` to a Scalar of given `dtype`.""" """Symbolically cast `x` to a Scalar of given `dtype`."""
if dtype == 'floatX': dtype = config.floatX if dtype == 'floatX': dtype = config.floatX
_x = as_scalar(x) _x = as_scalar(x)
...@@ -1245,7 +1284,7 @@ class RoundHalfToEven(UnaryScalarOp): ...@@ -1245,7 +1284,7 @@ class RoundHalfToEven(UnaryScalarOp):
See http://en.wikipedia.org/wiki/Rounding for more detail See http://en.wikipedia.org/wiki/Rounding for more detail
""" """
def impl(self, x): def impl(self, x):
return numpy.round(x) return numpy.round(x)
def c_code___(self, node, name, (x, ), (z, ), sub): def c_code___(self, node, name, (x, ), (z, ), sub):
typ = node.outputs[0].type.dtype typ = node.outputs[0].type.dtype
if not node.outputs[0].type.dtype in ['float32', 'float64']: if not node.outputs[0].type.dtype in ['float32', 'float64']:
...@@ -1255,7 +1294,7 @@ class RoundHalfToEven(UnaryScalarOp): ...@@ -1255,7 +1294,7 @@ class RoundHalfToEven(UnaryScalarOp):
#ifndef ROUNDING_EPSILON #ifndef ROUNDING_EPSILON
#define ROUNDING_EPSILON 0.0000001 #define ROUNDING_EPSILON 0.0000001
#endif #endif
if (%(x)s < 0.0){ if (%(x)s < 0.0){
// We implement the else part like that: -else( -%(x)s); // We implement the else part like that: -else( -%(x)s);
%(typ)s i; %(typ)s i;
...@@ -1323,8 +1362,8 @@ class RoundHalfAwayFromZero(UnaryScalarOp): ...@@ -1323,8 +1362,8 @@ class RoundHalfAwayFromZero(UnaryScalarOp):
See http://en.wikipedia.org/wiki/Rounding for more detail See http://en.wikipedia.org/wiki/Rounding for more detail
""" """
def impl(self, x): def impl(self, x):
return round_half_away_from_zero_vec(x) return round_half_away_from_zero_vec(x)
def c_code(self, node, name, (x, ), (z, ), sub): def c_code(self, node, name, (x, ), (z, ), sub):
if node.outputs[0].type.dtype in ['float32', 'float64']: if node.outputs[0].type.dtype in ['float32', 'float64']:
return "%(z)s = round(%(x)s);" % locals() return "%(z)s = round(%(x)s);" % locals()
...@@ -1337,9 +1376,9 @@ class Neg(UnaryScalarOp): ...@@ -1337,9 +1376,9 @@ class Neg(UnaryScalarOp):
return -x return -x
def grad(self, (x, ), (gz, )): def grad(self, (x, ), (gz, )):
if x.type in continuous_types: if x.type in continuous_types:
return -gz, return -gz,
else: else:
return None, return None,
def c_code(self, node, name, (x, ), (z, ), sub): def c_code(self, node, name, (x, ), (z, ), sub):
return "%(z)s = -%(x)s;" % locals() return "%(z)s = -%(x)s;" % locals()
neg = Neg(same_out, name = 'neg') neg = Neg(same_out, name = 'neg')
...@@ -1387,9 +1426,9 @@ class Log2(UnaryScalarOp): ...@@ -1387,9 +1426,9 @@ class Log2(UnaryScalarOp):
if x.type in complex_types: if x.type in complex_types:
raise NotImplementedError() raise NotImplementedError()
if x.type in float_types: if x.type in float_types:
return gz / (x * math.log(2.0)), return gz / (x * math.log(2.0)),
else: else:
return None, return None,
def c_code(self, node, name, (x, ), (z, ), sub): def c_code(self, node, name, (x, ), (z, ), sub):
if node.inputs[0].type in complex_types: if node.inputs[0].type in complex_types:
...@@ -1405,9 +1444,9 @@ class Log10(UnaryScalarOp): ...@@ -1405,9 +1444,9 @@ class Log10(UnaryScalarOp):
if x.type in complex_types: if x.type in complex_types:
raise NotImplementedError() raise NotImplementedError()
if x.type in float_types: if x.type in float_types:
return gz / (x * numpy.log(10.0)), return gz / (x * numpy.log(10.0)),
else: else:
return None return None
def c_code(self, node, name, (x, ), (z, ), sub): def c_code(self, node, name, (x, ), (z, ), sub):
if node.inputs[0].type in complex_types: if node.inputs[0].type in complex_types:
...@@ -1451,12 +1490,12 @@ class Sqr(UnaryScalarOp): ...@@ -1451,12 +1490,12 @@ class Sqr(UnaryScalarOp):
def impl(self, x): def impl(self, x):
return x*x return x*x
def grad(self, (x, ), (gz, )): def grad(self, (x, ), (gz, )):
if gz.type in complex_types: if gz.type in complex_types:
raise NotImplementedError() raise NotImplementedError()
if x.type in float_types: if x.type in float_types:
return gz * x * 2, return gz * x * 2,
else: else:
return None, return None,
def c_code(self, node, name, (x, ), (z, ), sub): def c_code(self, node, name, (x, ), (z, ), sub):
return "%(z)s = %(x)s * %(x)s;" % locals() return "%(z)s = %(x)s * %(x)s;" % locals()
...@@ -1466,12 +1505,12 @@ class Sqrt(UnaryScalarOp): ...@@ -1466,12 +1505,12 @@ class Sqrt(UnaryScalarOp):
def impl(self, x): def impl(self, x):
return numpy.sqrt(x) return numpy.sqrt(x)
def grad(self, (x, ), (gz, )): def grad(self, (x, ), (gz, )):
if gz.type in complex_types: if gz.type in complex_types:
raise NotImplementedError() raise NotImplementedError()
if x.type in float_types: if x.type in float_types:
return (gz * 0.5) / sqrt(x), return (gz * 0.5) / sqrt(x),
else: else:
return None, return None,
def c_code(self, node, name, (x, ), (z, ), sub): def c_code(self, node, name, (x, ), (z, ), sub):
if node.inputs[0].type in complex_types: if node.inputs[0].type in complex_types:
raise NotImplementedError('type not supported', type) raise NotImplementedError('type not supported', type)
...@@ -1482,12 +1521,12 @@ class Cos(UnaryScalarOp): ...@@ -1482,12 +1521,12 @@ class Cos(UnaryScalarOp):
def impl(self, x): def impl(self, x):
return numpy.cos(x) return numpy.cos(x)
def grad(self, (x, ), (gz, )): def grad(self, (x, ), (gz, )):
if gz.type in complex_types: if gz.type in complex_types:
raise NotImplementedError() raise NotImplementedError()
if x.type in float_types: if x.type in float_types:
return -gz * sin(x), return -gz * sin(x),
else: else:
return None, return None,
def c_code(self, node, name, (x, ), (z, ), sub): def c_code(self, node, name, (x, ), (z, ), sub):
if node.inputs[0].type in complex_types: if node.inputs[0].type in complex_types:
raise NotImplementedError('type not supported', type) raise NotImplementedError('type not supported', type)
...@@ -1498,12 +1537,12 @@ class Sin(UnaryScalarOp): ...@@ -1498,12 +1537,12 @@ class Sin(UnaryScalarOp):
def impl(self, x): def impl(self, x):
return numpy.sin(x) return numpy.sin(x)
def grad(self, (x, ), (gz, )): def grad(self, (x, ), (gz, )):
if x.type in complex_types: if x.type in complex_types:
raise NotImplementedError() raise NotImplementedError()
if x.type in float_types: if x.type in float_types:
return gz * cos(x), return gz * cos(x),
else: else:
return None, return None,
def c_code(self, node, name, (x, ), (z, ), sub): def c_code(self, node, name, (x, ), (z, ), sub):
if node.inputs[0].type in complex_types: if node.inputs[0].type in complex_types:
raise NotImplementedError('type not supported', type) raise NotImplementedError('type not supported', type)
...@@ -1514,12 +1553,12 @@ class Tan(UnaryScalarOp): ...@@ -1514,12 +1553,12 @@ class Tan(UnaryScalarOp):
def impl(self, x): def impl(self, x):
return numpy.tan(x) return numpy.tan(x)
def grad(self, (x, ), (gz, )): def grad(self, (x, ), (gz, )):
if x.type in complex_types: if x.type in complex_types:
raise NotImplementedError() raise NotImplementedError()
if x.type in float_types: if x.type in float_types:
return gz / sqr(cos(x)), return gz / sqr(cos(x)),
else: else:
return None, return None,
def c_code(self, node, name, (x, ), (z, ), sub): def c_code(self, node, name, (x, ), (z, ), sub):
if node.inputs[0].type in complex_types: if node.inputs[0].type in complex_types:
raise NotImplementedError('type not supported', type) raise NotImplementedError('type not supported', type)
...@@ -1533,12 +1572,12 @@ class Cosh(UnaryScalarOp): ...@@ -1533,12 +1572,12 @@ class Cosh(UnaryScalarOp):
def impl(self, x): def impl(self, x):
return numpy.cosh(x) return numpy.cosh(x)
def grad(self, (x, ), (gz, )): def grad(self, (x, ), (gz, )):
if x.type in complex_types: if x.type in complex_types:
raise NotImplementedError() raise NotImplementedError()
if x.type in float_types: if x.type in float_types:
return gz * sinh(x), return gz * sinh(x),
else: else:
return None, return None,
def c_code(self, node, name, (x, ), (z, ), sub): def c_code(self, node, name, (x, ), (z, ), sub):
if node.inputs[0].type in complex_types: if node.inputs[0].type in complex_types:
raise NotImplementedError('type not supported', type) raise NotImplementedError('type not supported', type)
...@@ -1552,12 +1591,12 @@ class Sinh(UnaryScalarOp): ...@@ -1552,12 +1591,12 @@ class Sinh(UnaryScalarOp):
def impl(self, x): def impl(self, x):
return numpy.sinh(x) return numpy.sinh(x)
def grad(self, (x, ), (gz, )): def grad(self, (x, ), (gz, )):
if x.type in complex_types: if x.type in complex_types:
raise NotImplementedError() raise NotImplementedError()
if x.type in float_types: if x.type in float_types:
return gz * cosh(x), return gz * cosh(x),
else: else:
return None, return None,
def c_code(self, node, name, (x, ), (z, ), sub): def c_code(self, node, name, (x, ), (z, ), sub):
if node.inputs[0].type in complex_types: if node.inputs[0].type in complex_types:
raise NotImplementedError('type not supported', type) raise NotImplementedError('type not supported', type)
...@@ -1572,12 +1611,12 @@ class Tanh(UnaryScalarOp): ...@@ -1572,12 +1611,12 @@ class Tanh(UnaryScalarOp):
def impl(self, x): def impl(self, x):
return numpy.tanh(x) return numpy.tanh(x)
def grad(self, (x, ), (gz, )): def grad(self, (x, ), (gz, )):
if x.type in complex_types: if x.type in complex_types:
raise NotImplementedError() raise NotImplementedError()
if x.type in float_types: if x.type in float_types:
return gz * (1 - sqr(tanh(x))), return gz * (1 - sqr(tanh(x))),
else: else:
return None, return None,
def c_code(self, node, name, (x, ), (z, ), sub): def c_code(self, node, name, (x, ), (z, ), sub):
if node.inputs[0].type in complex_types: if node.inputs[0].type in complex_types:
raise NotImplementedError('type not supported', type) raise NotImplementedError('type not supported', type)
...@@ -1651,7 +1690,7 @@ class Complex(BinaryScalarOp): ...@@ -1651,7 +1690,7 @@ class Complex(BinaryScalarOp):
def impl(self, x, y): def impl(self, x, y):
return numpy.complex(x, y) return numpy.complex(x, y)
def grad(self, (x,y), (gz,)): def grad(self, (x,y), (gz,)):
return [cast(real(gz), x.type.dtype), return [cast(real(gz), x.type.dtype),
cast(imag(gz), y.type.dtype)] cast(imag(gz), y.type.dtype)]
complex = Complex(name='complex') complex = Complex(name='complex')
...@@ -1678,7 +1717,7 @@ class ComplexFromPolar(BinaryScalarOp): ...@@ -1678,7 +1717,7 @@ class ComplexFromPolar(BinaryScalarOp):
def grad(self, (r,theta), (gz,)): def grad(self, (r,theta), (gz,)):
gr = cos(theta) * real(gz) + sin(theta) * imag(gz) gr = cos(theta) * real(gz) + sin(theta) * imag(gz)
gtheta = -real(gz) * r * sin(theta) + imag(gz) * r * cos(theta) gtheta = -real(gz) * r * sin(theta) + imag(gz) * r * cos(theta)
return [cast(gr, r.type.dtype), return [cast(gr, r.type.dtype),
cast(gtheta, theta.type.dtype)] cast(gtheta, theta.type.dtype)]
complex_from_polar = ComplexFromPolar(name='complex_from_polar') complex_from_polar = ComplexFromPolar(name='complex_from_polar')
...@@ -1702,7 +1741,7 @@ class Composite(ScalarOp): ...@@ -1702,7 +1741,7 @@ class Composite(ScalarOp):
def make_new_inplace(self, output_types_preference = None, name = None): def make_new_inplace(self, output_types_preference = None, name = None):
""" """
This op.__init__ fct don't have the same parameter as other scalar op. This op.__init__ fct don't have the same parameter as other scalar op.
This break the insert_inplace_optimizer optimization. This break the insert_inplace_optimizer optimization.
This fct allow fix patch this. This fct allow fix patch this.
""" """
out = self.__class__(self.inputs,self.outputs) out = self.__class__(self.inputs,self.outputs)
...@@ -1823,12 +1862,12 @@ class Composite(ScalarOp): ...@@ -1823,12 +1862,12 @@ class Composite(ScalarOp):
#The use of a dummy id is safe as the code is in a separate block. #The use of a dummy id is safe as the code is in a separate block.
#It won't generate conflicting variable name. #It won't generate conflicting variable name.
d['id']='_DUMMY_ID_' d['id']='_DUMMY_ID_'
return self._c_code % d return self._c_code % d
def c_code_cache_version(self): def c_code_cache_version(self):
return (1,)+tuple([x.op.c_code_cache_version() for x in self.env.toposort()]) return (1,)+tuple([x.op.c_code_cache_version() for x in self.env.toposort()])
def c_support_code(self): def c_support_code(self):
str = "" str = ""
for node in self.env.toposort(): for node in self.env.toposort():
...@@ -1859,11 +1898,9 @@ class Composite(ScalarOp): ...@@ -1859,11 +1898,9 @@ class Composite(ScalarOp):
d.pop('env') d.pop('env')
d.pop('_impls') d.pop('_impls')
return d return d
def __setstate__(self, d): def __setstate__(self, d):
self.__dict__.update(d) self.__dict__.update(d)
#we must call init to set env and _impls again. #we must call init to set env and _impls again.
#otherwise self.perform won't work. #otherwise self.perform won't work.
self.__init__(self.inputs, self.outputs) self.__init__(self.inputs, self.outputs)
#definition theano.scalar op that have their python implementation taked from scipy #definition theano.scalar op that have their python implementation taked from scipy
#as scipy is not always available, we put threat them separatly #as scipy is not always available, we put threat them separatly
from theano.scalar.basic import UnaryScalarOp,exp,sqrt,upgrade_to_float,complex_types,float_types,upcast
import numpy import numpy
from theano.scalar.basic import UnaryScalarOp,exp,upgrade_to_float,float_types
from theano.scalar.basic import upgrade_to_float_no_complex,complex_types,upcast
imported_scipy_special = False imported_scipy_special = False
try: try:
import scipy.special import scipy.special
...@@ -49,4 +50,6 @@ class Erfc(UnaryScalarOp): ...@@ -49,4 +50,6 @@ class Erfc(UnaryScalarOp):
if node.inputs[0].type in complex_types: if node.inputs[0].type in complex_types:
raise NotImplementedError('type not supported', type) raise NotImplementedError('type not supported', type)
return "%(z)s = erfc(%(x)s);" % locals() return "%(z)s = erfc(%(x)s);" % locals()
erfc = Erfc(upgrade_to_float, name = 'erfc')
# scipy.special.erfc don't support complex. Why?
erfc = Erfc(upgrade_to_float_no_complex, name = 'erfc')
...@@ -4414,6 +4414,7 @@ class numeric_grad: ...@@ -4414,6 +4414,7 @@ class numeric_grad:
x[i] += eps x[i] += eps
f_eps = f(*apt) f_eps = f(*apt)
gx[i] = numpy.asarray((f_eps - f_x)/eps) gx[i] = numpy.asarray((f_eps - f_x)/eps)
if packed_pt: if packed_pt:
...@@ -4594,6 +4595,7 @@ def verify_grad(fun, pt, n_tests=2, rng=None, eps=None, abs_tol=None, rel_tol=No ...@@ -4594,6 +4595,7 @@ def verify_grad(fun, pt, n_tests=2, rng=None, eps=None, abs_tol=None, rel_tol=No
for test_num in xrange(n_tests): for test_num in xrange(n_tests):
num_grad = numeric_grad(cost_fn, [p.copy() for p in pt], eps) num_grad = numeric_grad(cost_fn, [p.copy() for p in pt], eps)
analytic_grad = grad_fn(*[p.copy() for p in pt]) analytic_grad = grad_fn(*[p.copy() for p in pt])
if not isinstance(analytic_grad, (list, tuple)): if not isinstance(analytic_grad, (list, tuple)):
...@@ -4621,6 +4623,7 @@ class GradientError(Exception): ...@@ -4621,6 +4623,7 @@ class GradientError(Exception):
self.abs_tol = abs_tol self.abs_tol = abs_tol
self.rel_tol = rel_tol self.rel_tol = rel_tol
def __str__(self): def __str__(self):
return """GradientError: numeric gradient and analytic gradient exceed tolerance: return """GradientError: numeric gradient and analytic gradient exceed tolerance:
At position %i of argument %i, At position %i of argument %i,
......
...@@ -49,12 +49,16 @@ class Conv3D(theano.Op): ...@@ -49,12 +49,16 @@ class Conv3D(theano.Op):
def __str__(self): def __str__(self):
return "Conv3D" return "Conv3D"
def c_code_cache_version(self):
return (1,)
def make_node(self, V, W, b, d): def make_node(self, V, W, b, d):
""" """
:param V: Visible unit, input :param V: Visible unit, input(batch,row,column,time,in channel)
:param W: Weights, filter :param W: Weights, filter(out channel,row,column,time,in channel)
:param b: bias, shape == (W.shape[0],) :param b: bias, shape == (W.shape[0],)
:param d: strides when moving the filter over the input :param d: strides when moving the filter over the input(dx,dy,dt)
""" """
V_ = T.as_tensor_variable(V) V_ = T.as_tensor_variable(V)
...@@ -82,22 +86,22 @@ class Conv3D(theano.Op): ...@@ -82,22 +86,22 @@ class Conv3D(theano.Op):
dCdb = T.sum(dCdH, axis=(0,1,2,3)) dCdb = T.sum(dCdH, axis=(0,1,2,3))
dCdd = None #not differentiable, since d is not continuous dCdd = None #not differentiable, since d is not continuous
if 'name' in dir(dCdH) and dCdH.name != None: if 'name' in dir(dCdH) and dCdH.name is not None:
dCdH_name = dCdH.name dCdH_name = dCdH.name
else: else:
dCdH_name = 'anon' dCdH_name = 'anon'
if 'name' in dir(V) and V.name != None: if 'name' in dir(V) and V.name is not None:
V_name = V.name V_name = V.name
else: else:
V_name = 'anon' V_name = 'anon'
if 'name' in dir(W) and W.name != None: if 'name' in dir(W) and W.name is not None:
W_name = W.name W_name = W.name
else: else:
W_name = 'anon' W_name = 'anon'
if 'name' in dir(b) and b.name != None: if 'name' in dir(b) and b.name is not None:
b_name = b.name b_name = b.name
else: else:
b_name = 'anon' b_name = 'anon'
......
...@@ -3,6 +3,10 @@ from theano.tensor import basic as T ...@@ -3,6 +3,10 @@ from theano.tensor import basic as T
from theano.misc import strutil from theano.misc import strutil
import numpy as N import numpy as N
#TODO: speed up by reordering loops. Should pass through the videos once, incrementing all weight gradients, rather
# than visiting each weight gradient element once and passing through whole video
class ConvGrad3D(theano.Op): class ConvGrad3D(theano.Op):
""" Gradient of Conv3D with respect to W """ """ Gradient of Conv3D with respect to W """
def __eq__(self,other): def __eq__(self,other):
...@@ -11,6 +15,9 @@ class ConvGrad3D(theano.Op): ...@@ -11,6 +15,9 @@ class ConvGrad3D(theano.Op):
def __hash__(self): def __hash__(self):
return hash(type(self)) return hash(type(self))
def c_code_cache_version(self):
return (1,)
def make_node(self, V, d, WShape, dCdH): def make_node(self, V, d, WShape, dCdH):
V_ = T.as_tensor_variable(V) V_ = T.as_tensor_variable(V)
d_ = T.as_tensor_variable(d) d_ = T.as_tensor_variable(d)
......
...@@ -11,6 +11,9 @@ class ConvTransp3D(theano.Op): ...@@ -11,6 +11,9 @@ class ConvTransp3D(theano.Op):
def __hash__(self): def __hash__(self):
return hash(type(self)) return hash(type(self))
def c_code_cache_version(self):
return (1,)
def make_node(self, W, b, d, H, RShape = None): def make_node(self, W, b, d, H, RShape = None):
""" """
:param W: Weights, filter :param W: Weights, filter
...@@ -50,22 +53,22 @@ class ConvTransp3D(theano.Op): ...@@ -50,22 +53,22 @@ class ConvTransp3D(theano.Op):
dCdRShape = None #not differentiable, since RShape is not continuous dCdRShape = None #not differentiable, since RShape is not continuous
if 'name' in dir(dCdR) and dCdR.name != None: if 'name' in dir(dCdR) and dCdR.name is not None:
dCdR_name = dCdR.name dCdR_name = dCdR.name
else: else:
dCdR_name = 'anon' dCdR_name = 'anon'
if 'name' in dir(H) and H.name != None: if 'name' in dir(H) and H.name is not None:
H_name = H.name H_name = H.name
else: else:
H_name = 'anon' H_name = 'anon'
if 'name' in dir(W) and W.name != None: if 'name' in dir(W) and W.name is not None:
W_name = W.name W_name = W.name
else: else:
W_name = 'anon' W_name = 'anon'
if 'name' in dir(b) and b.name != None: if 'name' in dir(b) and b.name is not None:
b_name = b.name b_name = b.name
else: else:
b_name = 'anon' b_name = 'anon'
...@@ -79,9 +82,9 @@ class ConvTransp3D(theano.Op): ...@@ -79,9 +82,9 @@ class ConvTransp3D(theano.Op):
def perform(self, node, inputs, output_storage): def perform(self, node, inputs, output_storage):
W, b, d, H, RShape = inputs W, b, d, H, RShape = inputs
print "\t\t\t\tConvTransp3D python code" print "\t\t\t\tConvTransp3D python code"
output_storage[0][0] = computeR(W,b,d,H,RShape) output_storage[0][0] = computeR(W,b,d,H,RShape)
def c_code(self, node, nodename, (W, b, d, H, RShape), outputs, sub): def c_code(self, node, nodename, (W, b, d, H, RShape), outputs, sub):
fail = sub['fail'] fail = sub['fail']
...@@ -360,7 +363,7 @@ def computeR(W,b,d,H,Rshape = None): ...@@ -360,7 +363,7 @@ def computeR(W,b,d,H,Rshape = None):
videoWidth = (outputWidth-1) * dc + filterWidth videoWidth = (outputWidth-1) * dc + filterWidth
videoDur = (outputDur-1) * dt + filterDur videoDur = (outputDur-1) * dt + filterDur
if Rshape != None and Rshape[0] != -1: if Rshape is not None and Rshape[0] != -1:
if Rshape[0] < videoHeight: if Rshape[0] < videoHeight:
print (Rshape[0], videoHeight) print (Rshape[0], videoHeight)
assert False assert False
......
...@@ -290,7 +290,7 @@ class ConvOp(Op): ...@@ -290,7 +290,7 @@ class ConvOp(Op):
:type dx: int :type dx: int
:param dx: patch stride rows :param dx: patch stride rows
:type dy: int :type dy: int
:param dx: patch stride cols :param dy: patch stride cols
Params which select the version of code used: Params which select the version of code used:
......
...@@ -283,7 +283,6 @@ def local_dimshuffle_lift(node): ...@@ -283,7 +283,6 @@ def local_dimshuffle_lift(node):
else: else:
return DimShuffle(iinput.type.broadcastable, new_order, inplace).make_node(iinput).outputs return DimShuffle(iinput.type.broadcastable, new_order, inplace).make_node(iinput).outputs
@register_specialize
@gof.local_optimizer([]) @gof.local_optimizer([])
def dimshuffle_as_view(node): def dimshuffle_as_view(node):
op = node.op op = node.op
...@@ -293,6 +292,7 @@ def dimshuffle_as_view(node): ...@@ -293,6 +292,7 @@ def dimshuffle_as_view(node):
return [new_op(*node.inputs)] return [new_op(*node.inputs)]
register_specialize(dimshuffle_as_view, 'inplace')
register_canonicalize(local_dimshuffle_lift) register_canonicalize(local_dimshuffle_lift)
register_specialize(local_dimshuffle_lift) register_specialize(local_dimshuffle_lift)
...@@ -2313,15 +2313,21 @@ def local_add_specialize(node): ...@@ -2313,15 +2313,21 @@ def local_add_specialize(node):
y = get_constant_value(input) y = get_constant_value(input)
except TypeError: except TypeError:
y = input y = input
if N.all(y == 0.0): if numpy.all(y == 0.0):
continue continue
new_inputs.append(input) new_inputs.append(input)
if len(new_inputs) < len(node.inputs): if len(new_inputs) < len(node.inputs):
if len(new_inputs) == 0: if len(new_inputs) == 0:
#we got rid of the entire expression! #we got rid of the entire expression!
return fill_chain(T.TensorConstant(T.TensorType(dtype=node.outputs[0].type.dtype, ndim = node.outputs[0].type.ndim
broadcastable = [True] * node.outputs[0].ndim), N.asarray(0))) dtype = node.outputs[0].type.dtype
return fill_chain(
T.TensorConstant(
T.TensorType(
dtype=dtype,
broadcastable = [True] * ndim),
numpy.zeros((1,)*ndim, dtype=dtype)))
if len(new_inputs) == 1: if len(new_inputs) == 1:
return fill_chain(new_inputs[0]) return fill_chain(new_inputs[0])
......
...@@ -197,6 +197,7 @@ def makeTester(name, op, expected, checks = {}, good = {}, bad_build = {}, bad_r ...@@ -197,6 +197,7 @@ def makeTester(name, op, expected, checks = {}, good = {}, bad_build = {}, bad_r
rand = lambda *shape: 2 * numpy.asarray(numpy.random.rand(*shape), dtype=config.floatX) - 1 rand = lambda *shape: 2 * numpy.asarray(numpy.random.rand(*shape), dtype=config.floatX) - 1
randint = lambda *shape: numpy.random.random_integers(-5, 5, shape) randint = lambda *shape: numpy.random.random_integers(-5, 5, shape)
randcomplex = lambda *shape: numpy.complex128(2 * numpy.asarray(numpy.random.rand(*shape), dtype=config.floatX) - 1)
def randint_nonzero(*shape): def randint_nonzero(*shape):
r = numpy.random.random_integers(-5, 4, shape) r = numpy.random.random_integers(-5, 4, shape)
...@@ -208,6 +209,8 @@ def rand_ranged(min, max, shape): ...@@ -208,6 +209,8 @@ def rand_ranged(min, max, shape):
def randint_ranged(min, max, shape): def randint_ranged(min, max, shape):
return numpy.random.random_integers(min, max, shape) return numpy.random.random_integers(min, max, shape)
def randc128_ranged(min, max, shape):
return numpy.asarray(numpy.random.rand(*shape) * (max - min) + min, dtype='complex128')
def makeBroadcastTester(op, expected, checks = {}, **kwargs): def makeBroadcastTester(op, expected, checks = {}, **kwargs):
name = str(op) + "Tester" name = str(op) + "Tester"
...@@ -234,6 +237,10 @@ _good_broadcast_binary_normal = dict(same_shapes = (rand(2, 3), rand(2, 3)), ...@@ -234,6 +237,10 @@ _good_broadcast_binary_normal = dict(same_shapes = (rand(2, 3), rand(2, 3)),
integers = (randint(2, 3), randint(2, 3)), integers = (randint(2, 3), randint(2, 3)),
dtype_mixup_1 = (rand(2, 3), randint(2, 3)), dtype_mixup_1 = (rand(2, 3), randint(2, 3)),
dtype_mixup_2 = (randint(2, 3), rand(2, 3)), dtype_mixup_2 = (randint(2, 3), rand(2, 3)),
complex1 = (randcomplex(2,3),randcomplex(2,3)),
complex2 = (randcomplex(2,3),rand(2,3)),
# Disabled as we test the case where we reuse the same output as the first inputs.
# complex3 = (rand(2,3),randcomplex(2,3)),
empty = (numpy.asarray([]),numpy.asarray([1])), empty = (numpy.asarray([]),numpy.asarray([1])),
) )
...@@ -248,6 +255,10 @@ _grad_broadcast_binary_normal = dict(same_shapes = (rand(2, 3), rand(2, 3)), ...@@ -248,6 +255,10 @@ _grad_broadcast_binary_normal = dict(same_shapes = (rand(2, 3), rand(2, 3)),
column = (rand(2, 3), rand(2, 1)), column = (rand(2, 3), rand(2, 1)),
#This don't work as verify grad don't support that #This don't work as verify grad don't support that
#empty = (numpy.asarray([]), numpy.asarray([1])) #empty = (numpy.asarray([]), numpy.asarray([1]))
#complex1 = (randcomplex(2,3),randcomplex(2,3)),
#complex2 = (randcomplex(2,3),rand(2,3)),
# Disabled as we test the case where we reuse the same output as the first inputs.
#complex3 = (rand(2,3),randcomplex(2,3)),
) )
...@@ -338,6 +349,9 @@ _good_broadcast_div_mod_normal_float_inplace = dict(same_shapes = (rand(2, 3), r ...@@ -338,6 +349,9 @@ _good_broadcast_div_mod_normal_float_inplace = dict(same_shapes = (rand(2, 3), r
dtype_mixup_2 = (randint_nonzero(2, 3), rand(2, 3)), dtype_mixup_2 = (randint_nonzero(2, 3), rand(2, 3)),
#integers_positive = (randint_ranged(4, 10, (2, 3)), randint_ranged(1, 6, (2, 3))), #integers_positive = (randint_ranged(4, 10, (2, 3)), randint_ranged(1, 6, (2, 3))),
#integers_known_to_fail = (numpy.array(-1), numpy.array(5)) #integers_known_to_fail = (numpy.array(-1), numpy.array(5))
complex1 = (randcomplex(2,3),randcomplex(2,3)),
complex2 = (randcomplex(2,3),rand(2,3)),
#complex3 = (rand(2,3),randcomplex(2,3)),# Inplace on the first element. Must have the same type.
empty1 = (numpy.asarray([]), numpy.asarray([1])), empty1 = (numpy.asarray([]), numpy.asarray([1])),
#empty2 = (numpy.asarray([0]), numpy.asarray([])), #empty2 = (numpy.asarray([0]), numpy.asarray([])),
) )
...@@ -345,11 +359,16 @@ _good_broadcast_div_mod_normal_float = dict(empty2 = (numpy.asarray([0]), numpy. ...@@ -345,11 +359,16 @@ _good_broadcast_div_mod_normal_float = dict(empty2 = (numpy.asarray([0]), numpy.
**_good_broadcast_div_mod_normal_float_inplace **_good_broadcast_div_mod_normal_float_inplace
) )
_grad_broadcast_div_mod_normal = dict(same_shapes = (rand(2, 3), rand(2, 3)), _grad_broadcast_div_mod_normal = dict(same_shapes = (rand(2, 3), rand(2, 3)),
scalar = (rand(2, 3), rand(1, 1)), scalar = (rand(2, 3), rand(1, 1)),
row = (rand(2, 3), rand(1, 3)), row = (rand(2, 3), rand(1, 3)),
column = (rand(2, 3), rand(2, 1)), column = (rand(2, 3), rand(2, 1)),
#empty1 = (numpy.asarray([]), numpy.asarray([1.])), #complex1 = (randcomplex(2,3),randcomplex(2,3)),
#empty2 = (numpy.asarray([0]), numpy.asarray([])), #complex2 = (randcomplex(2,3),rand(2,3)),
#complex3 = (rand(2,3),randcomplex(2,3)),
#dtype_mixup_1 = (rand(2, 3), randint_nonzero(2, 3)),
#dtype_mixup_2 = (randint_nonzero(2, 3), rand(2, 3)),
#empty1 = (numpy.asarray([]), numpy.asarray([1.])),
#empty2 = (numpy.asarray([0]), numpy.asarray([])),
) )
div_grad_rtol=None div_grad_rtol=None
...@@ -374,14 +393,14 @@ DivInplaceTester = makeBroadcastTester(op = inplace.true_div_inplace, ...@@ -374,14 +393,14 @@ DivInplaceTester = makeBroadcastTester(op = inplace.true_div_inplace,
inplace = True) inplace = True)
ModTester = makeBroadcastTester(op = mod, ModTester = makeBroadcastTester(op = mod,
expected = lambda x, y: x % y, expected = lambda x, y: numpy.asarray(x % y, dtype=theano.scalar.basic.upcast(x.dtype, y.dtype)),
good = _good_broadcast_div_mod_normal_float, good = _good_broadcast_div_mod_normal_float,
# integers = (randint(2, 3), randint_nonzero(2, 3)), # integers = (randint(2, 3), randint_nonzero(2, 3)),
# dtype_mixup_1 = (rand(2, 3), randint_nonzero(2, 3)), # dtype_mixup_1 = (rand(2, 3), randint_nonzero(2, 3)),
# dtype_mixup_2 = (randint_nonzero(2, 3), rand(2, 3))), # dtype_mixup_2 = (randint_nonzero(2, 3), rand(2, 3))),
) )
ModInplaceTester = makeBroadcastTester(op = inplace.mod_inplace, ModInplaceTester = makeBroadcastTester(op = inplace.mod_inplace,
expected = lambda x, y: x % y, expected = lambda x, y: numpy.asarray(x % y, dtype=theano.scalar.basic.upcast(x.dtype, y.dtype)),
good = _good_broadcast_div_mod_normal_float_inplace, good = _good_broadcast_div_mod_normal_float_inplace,
inplace = True) inplace = True)
...@@ -390,12 +409,18 @@ _good_broadcast_pow_normal_float = dict(same_shapes = (rand_ranged(1, 5, (2, 3)) ...@@ -390,12 +409,18 @@ _good_broadcast_pow_normal_float = dict(same_shapes = (rand_ranged(1, 5, (2, 3))
row = (rand_ranged(1, 5, (2, 3)), rand_ranged(-3, 3, (1, 3))), row = (rand_ranged(1, 5, (2, 3)), rand_ranged(-3, 3, (1, 3))),
column = (rand_ranged(1, 5, (2, 3)), rand_ranged(-3, 3, (2, 1))), column = (rand_ranged(1, 5, (2, 3)), rand_ranged(-3, 3, (2, 1))),
dtype_mixup = (rand_ranged(-3, 3, (2, 3)), randint_ranged(-3, 3, (2, 3))), dtype_mixup = (rand_ranged(-3, 3, (2, 3)), randint_ranged(-3, 3, (2, 3))),
complex1 = (randcomplex(2,3),randcomplex(2,3)),
complex2 = (randcomplex(2,3),rand(2,3)),
#complex3 = (rand(2,3),randcomplex(2,3)), # Inplace on the first element.
empty1 = (numpy.asarray([]), numpy.asarray([1])), empty1 = (numpy.asarray([]), numpy.asarray([1])),
empty2 = (numpy.asarray([0]), numpy.asarray([])),) empty2 = (numpy.asarray([0]), numpy.asarray([])),)
_grad_broadcast_pow_normal = dict(same_shapes = (rand_ranged(1, 5, (2, 3)), rand_ranged(-3, 3, (2, 3))), _grad_broadcast_pow_normal = dict(same_shapes = (rand_ranged(1, 5, (2, 3)), rand_ranged(-3, 3, (2, 3))),
scalar = (rand_ranged(1, 5, (2, 3)), rand_ranged(-3, 3, (1, 1))), scalar = (rand_ranged(1, 5, (2, 3)), rand_ranged(-3, 3, (1, 1))),
row = (rand_ranged(1, 5, (2, 3)), rand_ranged(-3, 3, (1, 3))), row = (rand_ranged(1, 5, (2, 3)), rand_ranged(-3, 3, (1, 3))),
column = (rand_ranged(1, 5, (2, 3)), rand_ranged(-3, 3, (2, 1))), column = (rand_ranged(1, 5, (2, 3)), rand_ranged(-3, 3, (2, 1))),
#complex1 = (randcomplex(2,3),randcomplex(2,3)),
#complex2 = (randcomplex(2,3),rand(2,3)),
#complex3 = (rand(2,3),randcomplex(2,3)),
#empty1 = (numpy.asarray([]), numpy.asarray([1])), #empty1 = (numpy.asarray([]), numpy.asarray([1])),
#empty2 = (numpy.asarray([0]), numpy.asarray([])), #empty2 = (numpy.asarray([0]), numpy.asarray([])),
) )
...@@ -420,18 +445,31 @@ corner_case = numpy.asarray([-2.5, -2., -1.5, -1., -0.5, -.51, -.49, 0, 0.49, 0. ...@@ -420,18 +445,31 @@ corner_case = numpy.asarray([-2.5, -2., -1.5, -1., -0.5, -.51, -.49, 0, 0.49, 0.
corner_case_grad = numpy.asarray([-2.5, -2., -1.5, -1., -0.5, -.51, -.49, 0.49, 0.5, 0.9, 1, 1.5, 2, 2.5], dtype=config.floatX) corner_case_grad = numpy.asarray([-2.5, -2., -1.5, -1., -0.5, -.51, -.49, 0.49, 0.5, 0.9, 1, 1.5, 2, 2.5], dtype=config.floatX)
_good_broadcast_unary_normal_float = dict(normal = (rand_ranged(-5, 5, (2, 3)),), _good_broadcast_unary_normal_float = dict(normal = (rand_ranged(-5, 5, (2, 3)),),
corner_case = (corner_case,), corner_case = (corner_case,),
complex = (randcomplex(2,3),),
empty = (numpy.asarray([]),)) empty = (numpy.asarray([]),))
_good_broadcast_unary_normal_float_no_empty = copy(_good_broadcast_unary_normal_float) _good_broadcast_unary_normal_float_no_empty = copy(_good_broadcast_unary_normal_float)
del _good_broadcast_unary_normal_float_no_empty['empty'] del _good_broadcast_unary_normal_float_no_empty['empty']
_good_broadcast_unary_normal_float_no_empty_no_complex = copy(_good_broadcast_unary_normal_float_no_empty)
del _good_broadcast_unary_normal_float_no_empty_no_complex['complex']
_good_broadcast_unary_normal_float_no_complex = copy(_good_broadcast_unary_normal_float)
del _good_broadcast_unary_normal_float_no_complex['complex']
_good_broadcast_unary_normal = dict(normal = (numpy.asarray(rand_ranged(-5, 5, (2, 3)),dtype=config.floatX),), _good_broadcast_unary_normal = dict(normal = (numpy.asarray(rand_ranged(-5, 5, (2, 3)),dtype=config.floatX),),
integers = (randint_ranged(-5, 5, (2, 3)),), integers = (randint_ranged(-5, 5, (2, 3)),),
corner_case = (corner_case,), corner_case = (corner_case,),
complex = (randcomplex(2,3),),
empty = (numpy.asarray([]),)) empty = (numpy.asarray([]),))
_good_broadcast_unary_normal_no_complex = dict(normal = (numpy.asarray(rand_ranged(-5, 5, (2, 3)),dtype=config.floatX),),
integers = (randint_ranged(-5, 5, (2, 3)),),
corner_case = (corner_case,),
#complex = (randcomplex(2,3),),
empty = (numpy.asarray([]),))
_grad_broadcast_unary_normal = dict(normal = (numpy.asarray(rand_ranged(-5, 5, (2, 3)),dtype=config.floatX),), _grad_broadcast_unary_normal = dict(normal = (numpy.asarray(rand_ranged(-5, 5, (2, 3)),dtype=config.floatX),),
corner_case = (corner_case_grad,), corner_case = (corner_case_grad,),
#complex = (randcomplex(2,3),),
#empty = (numpy.asarray([]),) #empty = (numpy.asarray([]),)
) )
...@@ -441,9 +479,12 @@ AbsTester = makeBroadcastTester(op = tensor.abs_, ...@@ -441,9 +479,12 @@ AbsTester = makeBroadcastTester(op = tensor.abs_,
expected = lambda x: abs(x), expected = lambda x: abs(x),
good = _good_broadcast_unary_normal, good = _good_broadcast_unary_normal,
grad = _grad_broadcast_unary_normal) grad = _grad_broadcast_unary_normal)
_good_broadcast_unary_normal_abs = copy(_good_broadcast_unary_normal)
# Can't do inplace on Abs as the input/output are not of the same type!
del _good_broadcast_unary_normal_abs['complex']
AbsInplaceTester = makeBroadcastTester(op = inplace.abs__inplace, AbsInplaceTester = makeBroadcastTester(op = inplace.abs__inplace,
expected = lambda x: numpy.abs(x), expected = lambda x: numpy.abs(x),
good = _good_broadcast_unary_normal, good = _good_broadcast_unary_normal_abs,
grad = _grad_broadcast_unary_normal, grad = _grad_broadcast_unary_normal,
inplace = True) inplace = True)
...@@ -458,33 +499,38 @@ NegInplaceTester = makeBroadcastTester(op = inplace.neg_inplace, ...@@ -458,33 +499,38 @@ NegInplaceTester = makeBroadcastTester(op = inplace.neg_inplace,
inplace = True) inplace = True)
SgnTester = makeBroadcastTester(op = sgn, SgnTester = makeBroadcastTester(op = sgn,
expected = numpy.sign, expected = numpy.sign,
good = _good_broadcast_unary_normal) good = _good_broadcast_unary_normal_no_complex,
grad = _grad_broadcast_unary_normal,)
SgnInplaceTester = makeBroadcastTester(op = inplace.sgn_inplace, SgnInplaceTester = makeBroadcastTester(op = inplace.sgn_inplace,
expected = numpy.sign, expected = numpy.sign,
good = _good_broadcast_unary_normal, good = _good_broadcast_unary_normal_no_complex,
inplace = True) grad = _grad_broadcast_unary_normal,
inplace = True)
CeilTester = makeBroadcastTester(op = ceil, CeilTester = makeBroadcastTester(op = ceil,
expected = lambda a: numpy.asarray(numpy.ceil(a),a.dtype), expected = lambda a: numpy.asarray(numpy.ceil(a),a.dtype),
good = _good_broadcast_unary_normal) good = _good_broadcast_unary_normal_no_complex,
grad = _grad_broadcast_unary_normal)
CeilInplaceTester = makeBroadcastTester(op = inplace.ceil_inplace, CeilInplaceTester = makeBroadcastTester(op = inplace.ceil_inplace,
expected = lambda a: numpy.asarray(numpy.ceil(a),a.dtype), expected = lambda a: numpy.asarray(numpy.ceil(a),a.dtype),
good = _good_broadcast_unary_normal, good = _good_broadcast_unary_normal_no_complex,
grad = _grad_broadcast_unary_normal,
inplace = True) inplace = True)
FloorTester = makeBroadcastTester(op = floor, FloorTester = makeBroadcastTester(op = floor,
expected = lambda a: numpy.asarray(numpy.floor(a),a.dtype), expected = lambda a: numpy.asarray(numpy.floor(a),a.dtype),
good = _good_broadcast_unary_normal, good = _good_broadcast_unary_normal_no_complex,
grad = _grad_broadcast_unary_normal) grad = _grad_broadcast_unary_normal)
FloorInplaceTester = makeBroadcastTester(op = inplace.floor_inplace, FloorInplaceTester = makeBroadcastTester(op = inplace.floor_inplace,
expected = lambda a: numpy.asarray(numpy.floor(a),a.dtype), expected = lambda a: numpy.asarray(numpy.floor(a),a.dtype),
good = _good_broadcast_unary_normal, good = _good_broadcast_unary_normal_no_complex,
grad = _grad_broadcast_unary_normal, grad = _grad_broadcast_unary_normal,
inplace = True) inplace = True)
RoundHalfToEvenTester = makeBroadcastTester(op = round_half_to_even, RoundHalfToEvenTester = makeBroadcastTester(op = round_half_to_even,
expected = numpy.round, expected = numpy.round,
good = _good_broadcast_unary_normal_float) good = _good_broadcast_unary_normal_float_no_complex)
# TODO: Why complex are accepted in the next one?
RoundHalfToEvenInplaceTester = makeBroadcastTester(op = inplace.round_half_to_even_inplace, RoundHalfToEvenInplaceTester = makeBroadcastTester(op = inplace.round_half_to_even_inplace,
expected = numpy.round, expected = numpy.round,
good = _good_broadcast_unary_normal_float, good = _good_broadcast_unary_normal_float,
...@@ -495,10 +541,10 @@ RoundHalfToEvenInplaceTester = makeBroadcastTester(op = inplace.round_half_to_ev ...@@ -495,10 +541,10 @@ RoundHalfToEvenInplaceTester = makeBroadcastTester(op = inplace.round_half_to_ev
#This happen in float32 mode. #This happen in float32 mode.
RoundHalfAwayFromZeroTester = makeBroadcastTester(op = round_half_away_from_zero, RoundHalfAwayFromZeroTester = makeBroadcastTester(op = round_half_away_from_zero,
expected = theano.scalar.basic.round_half_away_from_zero_vec, expected = theano.scalar.basic.round_half_away_from_zero_vec,
good = _good_broadcast_unary_normal_float_no_empty)#_good_broadcast_unary_normal_float) good = _good_broadcast_unary_normal_float_no_empty_no_complex)#_good_broadcast_unary_normal_float)
RoundHalfAwayFromZeroInplaceTester = makeBroadcastTester(op = inplace.round_half_away_from_zero_inplace, RoundHalfAwayFromZeroInplaceTester = makeBroadcastTester(op = inplace.round_half_away_from_zero_inplace,
expected = theano.scalar.basic.round_half_away_from_zero_vec, expected = theano.scalar.basic.round_half_away_from_zero_vec,
good = _good_broadcast_unary_normal_float_no_empty, good = _good_broadcast_unary_normal_float_no_empty_no_complex,
inplace = True) inplace = True)
SqrTester = makeBroadcastTester(op = sqr, SqrTester = makeBroadcastTester(op = sqr,
...@@ -524,10 +570,12 @@ ExpInplaceTester = makeBroadcastTester(op = inplace.exp_inplace, ...@@ -524,10 +570,12 @@ ExpInplaceTester = makeBroadcastTester(op = inplace.exp_inplace,
_good_broadcast_unary_positive = dict(normal = (rand_ranged(0.001, 5, (2, 3)),), _good_broadcast_unary_positive = dict(normal = (rand_ranged(0.001, 5, (2, 3)),),
integers = (randint_ranged(1, 5, (2, 3)),), integers = (randint_ranged(1, 5, (2, 3)),),
complex = (randc128_ranged(1, 5, (2,3)),),
empty = (numpy.asarray([]),), empty = (numpy.asarray([]),),
) )
_grad_broadcast_unary_positive = dict(normal = (rand_ranged(0.001, 5, (2, 3)),), _grad_broadcast_unary_positive = dict(normal = (rand_ranged(0.001, 5, (2, 3)),),
#complex = (randc128_ranged(1, 5, (2,3)),),
#empty = (numpy.asarray([]),), #empty = (numpy.asarray([]),),
) )
...@@ -586,9 +634,11 @@ SqrtInplaceTester = makeBroadcastTester(op = inplace.sqrt_inplace, ...@@ -586,9 +634,11 @@ SqrtInplaceTester = makeBroadcastTester(op = inplace.sqrt_inplace,
_good_broadcast_unary_wide = dict(normal = (rand_ranged(-1000, 1000, (2, 3)),), _good_broadcast_unary_wide = dict(normal = (rand_ranged(-1000, 1000, (2, 3)),),
integers = (randint_ranged(-1000, 1000, (2, 3)),), integers = (randint_ranged(-1000, 1000, (2, 3)),),
complex = (randc128_ranged(-1000, 1000, (2, 3)),),
empty = (numpy.asarray([]),),) empty = (numpy.asarray([]),),)
_grad_broadcast_unary_wide = dict(normal = (rand_ranged(-1000, 1000, (2, 3)),), _grad_broadcast_unary_wide = dict(normal = (rand_ranged(-1000, 1000, (2, 3)),),
#complex = (randc128_ranged(-1000, 1000, (2, 3)),),
#empty = (numpy.asarray([]),), #empty = (numpy.asarray([]),),
) )
...@@ -617,7 +667,7 @@ tan_grad_rtol = None ...@@ -617,7 +667,7 @@ tan_grad_rtol = None
if config.floatX=='float32': if config.floatX=='float32':
#We raise the relative tolerence for the grad as their is error in float32 #We raise the relative tolerence for the grad as their is error in float32
#This is probably caused by our way of computing the gradient error. #This is probably caused by our way of computing the gradient error.
tan_grad_rtol = 0.047 tan_grad_rtol = 0.052
TanTester = makeBroadcastTester(op = tan, TanTester = makeBroadcastTester(op = tan,
expected = numpy.tan, expected = numpy.tan,
good = dict(normal = (rand_ranged(-3.14, 3.14, (2, 3)),), good = dict(normal = (rand_ranged(-3.14, 3.14, (2, 3)),),
...@@ -667,6 +717,8 @@ TanhInplaceTester = makeBroadcastTester(op = inplace.tanh_inplace, ...@@ -667,6 +717,8 @@ TanhInplaceTester = makeBroadcastTester(op = inplace.tanh_inplace,
#inplace ops when the input is integer and the output is float* #inplace ops when the input is integer and the output is float*
# don't have a well defined behavior. We don't test that case. # don't have a well defined behavior. We don't test that case.
_good_broadcast_unary_normal_no_int_no_complex = _good_broadcast_unary_normal_no_complex.copy()
del _good_broadcast_unary_normal_no_int_no_complex['integers']
_good_broadcast_unary_normal_no_int = _good_broadcast_unary_normal.copy() _good_broadcast_unary_normal_no_int = _good_broadcast_unary_normal.copy()
del _good_broadcast_unary_normal_no_int['integers'] del _good_broadcast_unary_normal_no_int['integers']
...@@ -712,13 +764,13 @@ else: ...@@ -712,13 +764,13 @@ else:
empty = (numpy.asarray([]),),) empty = (numpy.asarray([]),),)
ErfcTester = makeBroadcastTester(op = erfc, ErfcTester = makeBroadcastTester(op = erfc,
expected = expected, expected = expected,
good = _good_broadcast_unary_normal, good = _good_broadcast_unary_normal_no_int_no_complex,
grad = _grad_broadcast_unary_normal, grad = _grad_broadcast_unary_normal,
eps = 2e-10, eps = 2e-10,
mode = mode_no_scipy) mode = mode_no_scipy)
ErfcInplaceTester = makeBroadcastTester(op = inplace.erfc_inplace, ErfcInplaceTester = makeBroadcastTester(op = inplace.erfc_inplace,
expected = expected, expected = expected,
good = _good_broadcast_unary_normal_no_int, good = _good_broadcast_unary_normal_no_int_no_complex,
grad = _grad_broadcast_unary_normal, grad = _grad_broadcast_unary_normal,
eps = 2e-10, eps = 2e-10,
mode = mode_no_scipy, mode = mode_no_scipy,
...@@ -732,6 +784,9 @@ DotTester = makeTester(name = 'DotTester', ...@@ -732,6 +784,9 @@ DotTester = makeTester(name = 'DotTester',
good = dict(correct1 = (rand(5, 7), rand(7, 5)), good = dict(correct1 = (rand(5, 7), rand(7, 5)),
correct2 = (rand(5, 7), rand(7, 9)), correct2 = (rand(5, 7), rand(7, 9)),
correct3 = (rand(5, 7), rand(7)), correct3 = (rand(5, 7), rand(7)),
complex1 = (randcomplex(5, 7), randcomplex(7)),
complex2 = (rand(5, 7), randcomplex(7)),
complex3 = (randcomplex(5, 7), rand(7)),
empty = (numpy.asarray([]),numpy.asarray([])),), empty = (numpy.asarray([]),numpy.asarray([])),),
bad_build = dict(), bad_build = dict(),
bad_runtime = dict(bad1 = (rand(5, 7), rand(5, 7)), bad_runtime = dict(bad1 = (rand(5, 7), rand(5, 7)),
...@@ -1754,52 +1809,58 @@ class T_Join_and_Split(unittest.TestCase): ...@@ -1754,52 +1809,58 @@ class T_Join_and_Split(unittest.TestCase):
class test_comparison(unittest.TestCase): class test_comparison(unittest.TestCase):
def test_gt(self): def test_gt(self):
x, y = fvector(), fvector() for dtype in ['float64', 'float32', 'complex64', 'complex128']:
fn = inplace_func([x,y], x > y) x, y = vector(dtype=dtype), vector(dtype=dtype)
l = numpy.asarray([0.,-1.,1.], dtype='float32') fn = inplace_func([x,y], x > y)
r = numpy.asarray([0.,1.,-1.], dtype='float32') l = numpy.asarray([0.,-1.,1.], dtype=dtype)
v = fn(l, r) r = numpy.asarray([0.,1.,-1.], dtype=dtype)
self.failUnless(numpy.all(v == (l > r)), (v, (l>r))) v = fn(l, r)
self.failUnless(numpy.all(v == (l > r)), (v, (l>r)))
def test_lt(self): def test_lt(self):
x, y = fvector(), fvector() for dtype in ['float64', 'float32', 'complex64', 'complex128']:
fn = inplace_func([x,y], x < y) x, y = vector(dtype=dtype), vector(dtype=dtype)
l = numpy.asarray([0.,-1.,1.], dtype='float32') fn = inplace_func([x,y], x < y)
r = numpy.asarray([0.,1.,-1.], dtype='float32') l = numpy.asarray([0.,-1.,1.], dtype=dtype)
v = fn(l, r) r = numpy.asarray([0.,1.,-1.], dtype=dtype)
self.failUnless(numpy.all(v == (l < r)), (v, (l<r))) v = fn(l, r)
self.failUnless(numpy.all(v == (l < r)), (v, (l<r)))
def test_le(self): def test_le(self):
x, y = fvector(), fvector() for dtype in ['float64', 'float32', 'complex64', 'complex128']:
fn = inplace_func([x,y], x <= y) x, y = vector(dtype=dtype), vector(dtype=dtype)
l = numpy.asarray([0.,-1.,1.], dtype='float32') fn = inplace_func([x,y], x <= y)
r = numpy.asarray([0.,1.,-1.], dtype='float32') l = numpy.asarray([0.,-1.,1.], dtype=dtype)
v = fn(l, r) r = numpy.asarray([0.,1.,-1.], dtype=dtype)
self.failUnless(numpy.all(v == (l <= r)), (v, (l<=r))) v = fn(l, r)
self.failUnless(numpy.all(v == (l <= r)), (v, (l<=r)))
def test_ge(self): def test_ge(self):
x, y = fvector(), fvector() for dtype in ['float64', 'float32', 'complex64', 'complex128']:
fn = inplace_func([x,y], x >= y) x, y = vector(dtype=dtype), vector(dtype=dtype)
l = numpy.asarray([0.,-1.,1.], dtype='float32') fn = inplace_func([x,y], x >= y)
r = numpy.asarray([0.,1.,-1.], dtype='float32') l = numpy.asarray([0.,-1.,1.], dtype=dtype)
v = fn(l, r) r = numpy.asarray([0.,1.,-1.], dtype=dtype)
self.failUnless(numpy.all(v == (l >= r)), (v, (l>=r))) v = fn(l, r)
self.failUnless(numpy.all(v == (l >= r)), (v, (l>=r)))
def test_eq(self): def test_eq(self):
x, y = fvector(), fvector() for dtype in ['float64', 'float32', 'complex64', 'complex128']:
fn = inplace_func([x,y], eq(x,y)) x, y = vector(dtype=dtype), vector(dtype=dtype)
l = numpy.asarray([0.,-1.,1.], dtype='float32') fn = inplace_func([x,y], eq(x,y))
r = numpy.asarray([0.,1.,-1.], dtype='float32') l = numpy.asarray([0.,-1.,1.], dtype=dtype)
v = fn(l, r) r = numpy.asarray([0.,1.,-1.], dtype=dtype)
self.failUnless(numpy.all(v == (l == r)), (v, (l==r))) v = fn(l, r)
self.failUnless(numpy.all(v == (l == r)), (v, (l==r)))
def test_neq(self): def test_neq(self):
x, y = fvector(), fvector() for dtype in ['float64', 'float32', 'complex64', 'complex128']:
fn = inplace_func([x,y], neq(x, y)) x, y = vector(dtype=dtype), vector(dtype=dtype)
l = numpy.asarray([0.,-1.,1.], dtype='float32') fn = inplace_func([x,y], neq(x, y))
r = numpy.asarray([0.,1.,-1.], dtype='float32') l = numpy.asarray([0.,-1.,1.], dtype=dtype)
v = fn(l, r) r = numpy.asarray([0.,1.,-1.], dtype=dtype)
self.failUnless(numpy.all(v == (l != r)), (v, (l!=r))) v = fn(l, r)
self.failUnless(numpy.all(v == (l != r)), (v, (l!=r)))
class test_bitwise(unittest.TestCase): class test_bitwise(unittest.TestCase):
def test_or(self): def test_or(self):
......
...@@ -876,8 +876,7 @@ class test_fusion(unittest.TestCase): ...@@ -876,8 +876,7 @@ class test_fusion(unittest.TestCase):
self.do(mode, cuda.float32_shared_constructor, shp, gpu=True) self.do(mode, cuda.float32_shared_constructor, shp, gpu=True)
def test_gpu_fusion_3d(self): def test_gpu_fusion_Xd(self):
shp=(5,5,5)
#we need the optimisation enabled, debug do this. #we need the optimisation enabled, debug do this.
if theano.config.mode == "FAST_COMPILE": if theano.config.mode == "FAST_COMPILE":
mode = theano.compile.mode.get_mode("FAST_RUN").including('local_elemwise_fusion','canonicalize','gpu') mode = theano.compile.mode.get_mode("FAST_RUN").including('local_elemwise_fusion','canonicalize','gpu')
...@@ -886,7 +885,10 @@ class test_fusion(unittest.TestCase): ...@@ -886,7 +885,10 @@ class test_fusion(unittest.TestCase):
import theano.sandbox.cuda as cuda import theano.sandbox.cuda as cuda
if not cuda.cuda_available: if not cuda.cuda_available:
raise SkipTest("cuda not available") raise SkipTest("cuda not available")
if cuda.opt.int_size == 4:
shp=(5,5,5,5)
else:
shp=(5,5,5)
self.do(mode, cuda.float32_shared_constructor, shp, gpu=True) self.do(mode, cuda.float32_shared_constructor, shp, gpu=True)
def speed_fusion(self, shared_fn = shared, gpu = False, s=None): def speed_fusion(self, shared_fn = shared, gpu = False, s=None):
...@@ -2174,3 +2176,15 @@ def test_local_mul_to_neg(): ...@@ -2174,3 +2176,15 @@ def test_local_mul_to_neg():
aval = numpy.random.randint(0,10,(2,2)).astype('int32') aval = numpy.random.randint(0,10,(2,2)).astype('int32')
assert f1(aval).dtype == a.dtype assert f1(aval).dtype == a.dtype
assert f2(aval).dtype == 'float64' assert f2(aval).dtype == 'float64'
def test_local_add_specialize():
# test of non-zero dimension
a = TT.vector()
s = TT.add(TT.zeros_like(a))
assert local_add_specialize.transform(s.owner)
# test of 0-d
a = TT.scalar()
s = TT.add(TT.zeros_like(a))
assert local_add_specialize.transform(s.owner)
...@@ -231,6 +231,7 @@ def makeSharedTester(shared_constructor_, ...@@ -231,6 +231,7 @@ def makeSharedTester(shared_constructor_,
total = self.theano_fct(x_shared) total = self.theano_fct(x_shared)
total_func = theano.function([],total) total_func = theano.function([],total)
total_func()
values_to_div = .5 values_to_div = .5
if self.op_by_matrix: if self.op_by_matrix:
...@@ -418,6 +419,7 @@ def makeSharedTester(shared_constructor_, ...@@ -418,6 +419,7 @@ def makeSharedTester(shared_constructor_,
#Test that we can replace with values of the different shape #Test that we can replace with values of the different shape
# but that will raise an error in some case, but not all # but that will raise an error in some case, but not all
specify_shape_fct()
x1_shared.set_value(x2) x1_shared.set_value(x2)
self.assertRaises(AssertionError, specify_shape_fct) self.assertRaises(AssertionError, specify_shape_fct)
...@@ -450,6 +452,7 @@ def makeSharedTester(shared_constructor_, ...@@ -450,6 +452,7 @@ def makeSharedTester(shared_constructor_,
assert numpy.allclose(self.ref_fct(x1_shared.value), self.ref_fct( x1_2)) assert numpy.allclose(self.ref_fct(x1_shared.value), self.ref_fct( x1_2))
shape_op_fct = theano.function([],x1_shared.shape) shape_op_fct = theano.function([],x1_shared.shape)
topo = shape_op_fct.maker.env.toposort() topo = shape_op_fct.maker.env.toposort()
shape_op_fct()
if theano.config.mode!='FAST_COMPILE': if theano.config.mode!='FAST_COMPILE':
assert len(topo)==3 assert len(topo)==3
assert isinstance(topo[0].op,tensor.opt.Shape_i) assert isinstance(topo[0].op,tensor.opt.Shape_i)
...@@ -458,6 +461,7 @@ def makeSharedTester(shared_constructor_, ...@@ -458,6 +461,7 @@ def makeSharedTester(shared_constructor_,
#Test that we forward the input #Test that we forward the input
specify_shape_fct = theano.function([],x1_specify_shape) specify_shape_fct = theano.function([],x1_specify_shape)
specify_shape_fct()
#theano.printing.debugprint(specify_shape_fct) #theano.printing.debugprint(specify_shape_fct)
assert numpy.all(self.ref_fct(specify_shape_fct()) assert numpy.all(self.ref_fct(specify_shape_fct())
==self.ref_fct(x1_2)) ==self.ref_fct(x1_2))
......
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论