Merged

6026b300 · Olivier Delalleau · f6ed1418 · 6dbad245 · 6026b300 · 6026b300
--- a/NEWS.txt
+++ b/NEWS.txt
@@ -2,7 +2,7 @@ Trunk sin last release
 ------
 * Sparse type is now supported by the shape op and the ShapeFeature optimizer work correctly with them.
 * fuse GpuElemwise more often(in the case where their is too many inputs that fusing all of them would bust the 256 bytes limits of parameter to gpu function)
+ * Speed up gemv by a work around scipy gemv slowness when the matrix is in c order(the default)
 Theano 0.3 (2010-11-23)
 -----------------------

--- a/doc/install.txt
+++ b/doc/install.txt
@@ -42,7 +42,8 @@ to be installed:
    A `BLAS`_ installation (with Level 3 functionality)
    	Including the development headers (``-dev``, ``-devel``, depending on
 	your Linux distribution). Mac OS X comes with the `Accelerate
-	framework`_ built in, and various options exist for Windows.
+	framework`_ built in, and various options exist for Windows (see
+        below).
 .. _BLAS: http://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms
 .. _Accelerate framework: http://developer.apple.com/performance/accelerateframework.html
@@ -380,8 +381,8 @@ that fail on your platform (use the ``theano-users@googlegroups.com`` mailing li
 but note that you must first register to it, by going to `theano-users`_).
-Windows V1 (bigger install, but simpler instructions + tentative GPU instructions)
+Windows V1 (Installing from Scratch)
----------------------------------------------------------------------------------
+------------------------------------
 - Install `Python(x,y) <http://www.pythonxy.com>`_ in a directory without blank
  spaces in the name (in particular not into ``C:\Program Files``).
@@ -437,25 +438,25 @@ Windows V1 (bigger install, but simpler instructions + tentative GPU instruction
      print theano.config.blas.ldflags
  This should print the same content as in your config file, i.e. nothing
-  (if your config file was not read properly, it would print ``-lblas``, and
+  (if your config file was not read properly, it would print '-lblas', and
  trying to compile any Theano function would result in a compilation error
-  due to the system being unable to find ``blas.dll``).
+  due to the system being unable to find 'blas.dll').
-Windows V1.5 (optional follow-up to V1 instructions)
+Windows: Using a Faster BLAS
----------------------------------------------------
+----------------------------
- If you want a faster and/or multithreaded BLAS library, you can
+If you want a faster and/or multithreaded BLAS library, you can
-  compile GotoBLAS2. We did not try to compile ATLAS because we read that
+compile GotoBLAS2 (ATLAS may work too, but was not tested, and is
-  it is slower than Goto and more difficult to compile (especially on
+usually reported to be slower and more difficult to compile -- especially
-  Windows).
+on Windows).
-  GotoBLAS2 can be downloaded
+GotoBLAS2 can be downloaded
-  `here <http://www.tacc.utexas.edu/tacc-projects/gotoblas2/downloads>`_
+`here <http://www.tacc.utexas.edu/tacc-projects/gotoblas2/downloads>`_
-  after registering on the website (we tested v1.13).
+after registering on the website (we tested v1.13).
-  To compile it, you will also need to install MSYS and Perl (for instance
+To compile it, you will also need to install MSYS and Perl,
-  ActivePerl).
+as described below.
-  The GotoBLAS makefiles actually expect a full UNIX environment (like
+The GotoBLAS makefiles actually expect a full UNIX environment (like
-  Cygwin) but the BLAS compilation seems to work with only MSYS and Perl.
+Cygwin) but the BLAS compilation seems to work with only MSYS and Perl
-  The LAPACK compilation fails, but Theano does not need it.
+(LAPACK compilation fails, but Theano does not need it).
  a) Download the mingw-get command-line installer from the
     `MinGW files <http://sourceforge.net/projects/mingw/>`_ (click
@@ -479,12 +480,12 @@ Windows V1.5 (optional follow-up to V1 instructions)
        /postinstall/pi.sh
    It will ask for your MinGW installation directory (e.g.
-    ``c:\pythonxy\mingw``).
+    ``c:/pythonxy/mingw``).
-  e) Download `ActivePerl <http://www.activestate.com/activeperl>`_ and
+  e) Download `ActivePerl <http://www.activestate.com/activeperl/downloads>`_ and
-     install it.
+     install it (other Perl interpreters should also work).
-  f) Unpack GotoBLAS2 (e.g. using `7-zip <http://www.7-zip.org/>`_ or in
+  f) Unpack GotoBLAS2, either using `7-zip <http://www.7-zip.org/>`_ or in
     MSYS with:
    .. code-block:: bash
@@ -500,47 +501,61 @@ Windows V1.5 (optional follow-up to V1 instructions)
      quickbuild.win32 1>log.txt 2>err.txt  
    Compilation should take a few minutes. Afterwards, you will probably
-    find many error messages in err.txt, but also a libgoto2.dll
+    find many error messages in err.txt, but there should be an ``exports``
-    file in the exports folder. [NOTE: INSTRUCTIONS TO BE CONTINUED]
+    folder containing in particular ``libgoto2.dll``.
-  i) Copy libgoto2.dll from the exports folder to ``pythonxy\mingw\bin``
+  i) Copy ``libgoto2.dll`` from the ``exports`` folder to ``pythonxy\mingw\bin``
     and ``pythonxy\mingw\lib``.
  j) Modify your .theanorc (or .theanorc.txt) with "ldflags = -lgoto2".
-     This setting can also be changed in Python for testing purposes:
+     This setting can also be changed in Python for testing purpose (in which
+     case it will remain only for the duration of your Python session):
    .. code-block:: python
        theano.config.blas.ldflags = "-lgoto2"
- (Optional). To test the BLAS performance, you can run the script ``check_blas.py``.
+  k) To test the BLAS performance, you can run the script
-  For comparison I also downloaded and compiled the unoptimized standard
+     ``theano/misc/check_blas.py``.
-  BLAS. The results were the following (Intel Core2 Duo 1.86 GHz):
+     Note that you may control the number of threads used by GotoBLAS2 with
+     the ``GOTO_NUM_THREADS`` environment variable (default behavior is to use
+     all available cores).
+     Here are some performance results on an Intel Core2 Duo 1.86 GHz,
+     compared to using Numpy's BLAS or the un-optimized standard BLAS
+     (compiled manually from its source code):
-         Standard BLAS: 166 sec (unoptimized, 1 thread)
+         * GotoBLAS2 (2 threads): 16s
-         NumPy: 48 sec (1 thread)
+         * NumPy (1 thread): 48s
-         Goto2: 16 sec (2 threads)
+         * Standard BLAS (un-optimized, 1 thread): 166s
     Conclusions:
-  a) The unoptimized standard BLAS is very slow. Don't use it.
+        * The unoptimized standard BLAS is very slow and should not be used.
-  b) The Windows binaries of NumPy were compiled with ATLAS and are surprisingly fast.
+        * The Windows binaries of NumPy were compiled with ATLAS and are surprisingly fast.
-  c) GotoBLAS is even faster, in particular if you have several kernels.
+        * GotoBLAS2 is even faster, in particular if you can use multiple cores.
- (Optional) Gpu on Windows. Not sur it work! Can you report success/error on the `theano-users <http://groups.google.com/group/theano-users>`_ mailing list?
+Windows: Using the GPU
+----------------------
-  Those are indication for 32-bit version of Python, the one that come with Python(x,y) is 32-bit.
+Please note that these are tentative instructions (we have not yet been able to
+get the GPU to work under Windows with Theano).
+Please report your own successes / failures on the
+`theano-users <http://groups.google.com/group/theano-users>`_ mailing list.
-  Space or non ascii caracter are not always supported in path. Python support 
+Those are instructions for the 32-bit version of Python (the one that comes
-  them, so your configuration file path can contain them. 
+with Python(x,y) is 32-bit).
-  nvcc(at least version 3.1) don't support them well. If your USERPROFILE 
-  directory contain those caractere, you must add in your configuration file:
+Blanks or non ASCII characters are not always supported in paths. Python supports
+them, but nvcc (at least version 3.1) does not.
+If your ``USERPROFILE`` directory (the one you get into when you run ``cmd``)
+contains such characters, you must edit your Theano configuration file to
+use a compilation directory located somewhere else:
    .. code-block:: cfg
      [global]
-      base_compiledir=PATH_TO_A_DIRECTORY_WITHOUT_THOSE_CARACTERE
+      base_compiledir=path_to_a_directory_without_such_characters
-  You also need to add in the configuration file those line:
+  You also need to add in the configuration file those lines:
    .. code-block:: cfg
@@ -578,8 +593,10 @@ Windows V1.5 (optional follow-up to V1 instructions)
      run the program nosetests inside the Theano repository. 
      nosetests is installed by Python(x,y).
-Windows V2(smaller install, but longer instruction)
+Windows V2: Installing Python Components Individually
---------------------------------------------------
+-----------------------------------------------------
+DISCLAIMER: These are old installation instructions (to be revised).
 Running Theano under Windows is currently achieved by using the `MinGW
 <http://www.mingw.org>`__ port of the GCC compiler.

--- a/theano/sandbox/neighbours.py
+++ b/theano/sandbox/neighbours.py
@@ -46,6 +46,9 @@ class Images2Neibs(Op):
        return Apply(self, [ten4, neib_shape,neib_step], [T.matrix(dtype=ten4.type.dtype)])
+    def grad(self, (x, neib_shape, neib_step), (gz,)):
+        return [neibs2images(gz, neib_shape, x.shape), None, None]
    def c_code_cache_version(self):
        return (3,)
@@ -211,36 +214,16 @@ class Images2Neibs(Op):
 def images2neibs(ten4, neib_shape, neib_step=None, mode='valid'):
    return Images2Neibs(mode)(ten4, neib_shape, neib_step)
-def neibs2images(neibs, neib_shape, original_shape, neib_step=None, mode='valid'):
+def neibs2images(neibs, neib_shape, original_shape):
    """
-    Inverse of images2neib. Don't implement neib_step and mode.
+    Inverse of images2neib.
-    :type neibs: Theano variable
-    :param neibs: matrix like the one obtained by images2neib
-    :type neib_shape: Theano variable
-    :param neib_shape: neib_shape that was used in images2neib
-    :type original_shape: Theano variable
-    :param original_shape: original shape of the 4d tensor given to images2neib.
-    :type neib_step: Theano variable or None
-    :param neib_step: neib_step that was used in images2neib Implement only None.
-                      None is non overlapping patches and not-adjacent patches.
-    :type mode: str
+    neibs : matrix like the one obtained by images2neib
-    :param mode: The mode that was used in images2neib. Implement only valid.
+    neib_shape : neib_shape that was used in images2neib
+    original_shape : original shape of the 4d tensor given to images2neib
    Return a 4d tensor of shape `original_shape`.
    """
-    # TODO: handle the case where patches either overlap
-    # TODO: handle the case where patches are not directly adjacent
-    # TODO: at least separate these cases so that the following code does not incorrectly
-    # handle them by accident.
-    if neib_step != None:
-        raise NotImplementedError('neibs2images do not implement overlapping patches or non-adjacent patches.')
-    if mode != 'valid':
-        raise NotImplementedError('neibs2images do not implement the mode %s. It currently only implement `valid`.'%mode)
    neibs = T.as_tensor_variable(neibs)
    neib_shape = T.as_tensor_variable(neib_shape)
    original_shape = T.as_tensor_variable(original_shape)

--- a/theano/sandbox/test_neighbours.py
+++ b/theano/sandbox/test_neighbours.py
@@ -7,6 +7,8 @@ from neighbours import images2neibs, neibs2images, Images2Neibs, GpuImages2Neibs
 from nose.plugins.skip import SkipTest
 import theano.sandbox.cuda as cuda
+from theano.tests import unittest_tools
 if theano.config.mode=='FAST_COMPILE':
    mode_with_gpu = theano.compile.mode.get_mode('FAST_RUN').including('gpu')
    mode_without_gpu = theano.compile.mode.get_mode('FAST_RUN').excluding('gpu')
@@ -328,8 +330,65 @@ def speed_neibs_wrap_centered():
    for i in range(1000):
        f()
+def test_neibs_grad():
+    shape = (2,3,4,4)
+    images = T.shared(numpy.arange(numpy.prod(shape), dtype='float32').reshape(shape))
+    cost = T.sum(T.sqr(images2neibs(images, (2,2))), axis=[0,1])
+    grad = T.grad(cost, images)
+    f = theano.function([], [cost, grad], mode=mode_without_gpu)
+    got = f()
+    should_get = [numpy.asarray(290320.0, dtype=numpy.float32),
+        numpy.asarray([[[[   0.,    2.,    4.,    6.],
+         [   8.,   10.,   12.,   14.],
+         [  16.,   18.,   20.,   22.],
+         [  24.,   26.,   28.,   30.]],
+        [[  32.,   34.,   36.,   38.],
+         [  40.,   42.,   44.,   46.],
+         [  48.,   50.,   52.,   54.],
+         [  56.,   58.,   60.,   62.]],
+        [[  64.,   66.,   68.,   70.],
+         [  72.,   74.,   76.,   78.],
+         [  80.,   82.,   84.,   86.],
+         [  88.,   90.,   92.,   94.]]],
+       [[[  96.,   98.,  100.,  102.],
+         [ 104.,  106.,  108.,  110.],
+         [ 112.,  114.,  116.,  118.],
+         [ 120.,  122.,  124.,  126.]],
+        [[ 128.,  130.,  132.,  134.],
+         [ 136.,  138.,  140.,  142.],
+         [ 144.,  146.,  148.,  150.],
+         [ 152.,  154.,  156.,  158.]],
+        [[ 160.,  162.,  164.,  166.],
+         [ 168.,  170.,  172.,  174.],
+         [ 176.,  178.,  180.,  182.],
+         [ 184.,  186.,  188.,  190.]]]], dtype=numpy.float32)]
+    assert numpy.allclose(got[0], should_get[0])
+    assert numpy.allclose(got[1], should_get[1])
+def test_neibs_grad_verify_grad():
+    shape = (2,3,4,4)
+    images = T.dtensor4()
+    images_val = numpy.arange(numpy.prod(shape), dtype='float32').reshape(shape)
+    def fn(images):
+        return T.sum(T.sqr(images2neibs(images, (2,2))), axis=[0,1])
+    unittest_tools.verify_grad(fn, [images_val])
 if __name__ == '__main__':
-    test_neibs_gpu()
+    #test_neibs_gpu()
-    test_neibs()
+    #test_neibs()
+    test_neibs_grad_verify_grad()
--- a/theano/tensor/blas.py
+++ b/theano/tensor/blas.py
@@ -85,10 +85,15 @@ class Gemv(Op):
    def perform(self, node, inputs, out_storage):
        y, alpha, A, x, beta = inputs
        if _have_fblas:
-            if not self.inplace:
-                y = y.copy()
            gemv = _blas_gemv_fns[y.dtype]
-            out_storage[0][0] = gemv(alpha, A, x, beta, y, overwrite_y=self.inplace)
+            #Here I suppose that A is in c order. If we don't make it explicitly
+            #  as fortran order, scipy 0.7.2 seam to create a copy in fortran
+            #  order instead of just reshaping it and using the trans flag.
+            #If A is already in fortran order, make it in c order and using the
+            #  trans flag don't seam to cause slowdown.
+            #out_storage[0][0] = gemv(alpha, A, x, beta, y, overwrite_y=self.inplace)
+            out_storage[0][0] = gemv(alpha, A.T, x, beta, y, overwrite_y=self.inplace, trans=True)
        else:
            out_storage[0][0] = numpy.asarray(
                    beta * y + alpha * numpy.dot(A, x)

--- a/theano/tensor/elemwise.py
+++ b/theano/tensor/elemwise.py
@@ -1155,8 +1155,40 @@ class Prod(CAReduce):
    def grad(self, (x, ), (gz, )):
        if x.dtype[0:3] in ('int','uin'):
            return [None]
+        prod_out = self(x)
+        gz = as_tensor_variable(gz)
+        axis = self.axis
+        if axis is None:
+            axis = range(x.type.ndim)
+        if axis == ():
+            return gz,
+        new_dims = []
+        i = 0
+        for j, _ in enumerate(x.type.broadcastable):
+            if j in axis:
+                new_dims.append('x')
            else:
-            raise NotImplementedError('Will be implemented shortly')
+                new_dims.append(i)
+                i += 1
+        # fill a matrix with the same shape as x by broadcasting
+        # values taken from gz, which has the same shape as the output
+        # of prod().
+        gz_filled_x = Elemwise(scalar.second)(x, 
+                        DimShuffle(gz.type.broadcastable, new_dims)(gz))
+        # do the same with the output of prod, by broadcasting along
+        # axises where the product was taken
+        prod_out_filled_x = Elemwise(scalar.second)(x, 
+                        DimShuffle(prod_out.type.broadcastable,
+                                    new_dims)(prod_out))
+        return [theano.tensor.mul(gz_filled_x,
+                    theano.tensor.true_div(prod_out_filled_x, x))]
+        #else:
+        #    raise NotImplementedError('Will be implemented shortly')
    def __str__(self):
        if self.axis is None:

--- a/theano/tensor/opt.py
+++ b/theano/tensor/opt.py
@@ -459,9 +459,32 @@ class ShapeFeature(object):
        to promise that inputs will have a certain shape (or even to have certain shapes in
        certain dimensions). We can't automatically infer the shape of shared variable as
        they can change of shape during the execution by default.
-        (NOT IMPLEMENTED YET)
+        (NOT IMPLEMENTED YET, BUT IS IN TRAC)
+    Using Shape information in Optimizations
+    ========================================
+    To use this shape information in OPTIMIZATIONS, use the ``shape_of`` dictionary.
+    For example:
+    .. code-block:: python
+        try:
+            shape_of = node.env.shape_feature.shape_of
+        except AttributeError:
+            # This can happen when the compilation mode doesn't include the ShapeFeature.
+            return
+        shape_of_output_zero = shape_of[node.output[0]]
+    The ``shape_of_output_zero'' symbol will contain a tuple, whose elements are either
+    integers or symbolic integers.
+    TODO: check to see if the symbols are necessarily non-constant... or are integer literals
+    sometimes Theano constants?? That would be confusing.
    """
    def shape_i(self, i):
        def op_deco(r):

--- a/theano/tensor/tests/test_blas.py
+++ b/theano/tensor/tests/test_blas.py
@@ -687,27 +687,55 @@ def test_dot_mv():
 def test_gemv1():
    ''' test vector1+dot(matrix,vector2) '''
    v1 = theano.shared( numpy.array(numpy.random.rand(2)  , dtype='float32'))
-    v2 = theano.shared( numpy.array(numpy.random.rand(2)  , dtype='float32'))
+    v2_orig = numpy.array(numpy.random.rand(2), dtype='float32')
+    v2 = theano.shared( v2_orig )
    m  = theano.shared( numpy.array(numpy.random.rand(2,2), dtype='float32'))
    f = theano.function([], v2+theano.dot(m,v1), mode = mode_blas_opt)
    # Assert they produce the same output
-    assert numpy.allclose(f(), numpy.dot(m.value,v1.value)+v2.value)
+    assert numpy.allclose(f(), numpy.dot(m.value,v1.value)+v2_orig)
+    topo = f.maker.env.toposort()
+    assert len(topo)==1
+    assert isinstance(topo[0].op, Gemv)
+    assert topo[0].op.inplace==False
+    #test the inplace version
+    f = theano.function([], [], updates={v2:v2+theano.dot(m,v1)}
+                        , mode = mode_blas_opt)
+    # Assert they produce the same output
+    f()
+    assert numpy.allclose(v2.value, numpy.dot(m.value,v1.value)+v2_orig)
+    topo = f.maker.env.toposort()
+    assert len(topo)==1
+    assert isinstance(topo[0].op, Gemv)
+    assert topo[0].op.inplace==True
-    assert sum([isinstance(node.op, Gemv) for node in
-                f.maker.env.toposort() ]) == 1
 def test_gemv2():
    ''' test vector1+dot(vector2,matrix) '''
    v1 = theano.shared( numpy.array(numpy.random.rand(2)  , dtype='float32'))
-    v2 = theano.shared( numpy.array(numpy.random.rand(2)  , dtype='float32'))
+    v2_orig = numpy.array(numpy.random.rand(2), dtype='float32')
+    v2 = theano.shared( v2_orig )
    m  = theano.shared( numpy.array(numpy.random.rand(2,2), dtype='float32'))
    f = theano.function([], v2+theano.dot(v1,m), mode = mode_blas_opt)
    # Assert they produce the same output
    assert numpy.allclose(f(), numpy.dot(v1.value,m.value)+v2.value)
+    topo = f.maker.env.toposort()
+    assert sum(isinstance(node.op, Gemv) for node in topo)==1
+    assert topo[-1].op.inplace==False
+    #test the inplace version
+    f = theano.function([], [], updates={v2:v2+theano.dot(v1,m)}
+                        , mode = mode_blas_opt)
+    # Assert they produce the same output
+    f()
+    assert numpy.allclose(v2.value, numpy.dot(v1.value, m.value)+v2_orig)
+    topo = f.maker.env.toposort()
+    assert sum(isinstance(node.op, Gemv) for node in topo)==1
+    assert topo[0].op.inplace==True
-    assert sum([isinstance(node.op, Gemv) for node in
-                f.maker.env.toposort() ]) == 1
--- a/theano/tensor/tests/test_elemwise.py
+++ b/theano/tensor/tests/test_elemwise.py
@@ -254,5 +254,52 @@ class test_CAReduce(unittest.TestCase):
        #self.with_linker(gof.CLinker(), and_)
+class test_Prod(unittest.TestCase):
+    def setUp(self):
+        unittest_tools.seed_rng()
+    def test_prod_grad(self):
+        x_val = numpy.asarray([[1,2,3],[4,5,6],[7,8,9]], dtype='float32')
+        x = theano.tensor.dmatrix()
+        p = Prod(axis=0)(x)
+        # sanity check
+        fn = theano.function([x], [p])
+        assert numpy.allclose(fn(x_val), numpy.array([  28.,  80.,  162.]))
+        # very basic case for the product; no broadcasting in x
+        g = theano.tensor.grad(p.sum(), x)
+        g_fn = theano.function([x], g)
+        assert numpy.allclose(g_fn(x_val),
+                numpy.asarray([[28.,40.,54.],[7.,16.,27.],[4.,10.,18.]]))
+        # now with some tranposition in input
+        x_bc = x.dimshuffle(1, 0)
+        p_bc = Prod(axis=0)(x_bc)
+        p_bc_sum = p_bc.sum()
+        g_bc = theano.tensor.grad(p_bc_sum, x)
+        g_fn_bc = theano.function([x], [p_bc,g_bc])
+        p_bc_ret, g_bc_ret =  g_fn_bc(x_val)
+        assert numpy.allclose(p_bc_ret, numpy.array([  6.,  120.,  504.]))
+        assert numpy.allclose(g_bc_ret,
+                numpy.asarray([[6.,3.,2.],[30.,24.,20.],[72.,63.,56.]]))
+    def test_verify_grad(self):
+        x_val = numpy.asarray([[1,2,3],[4,5,6],[7,8,9]], dtype='float32')
+        x = theano.tensor.dmatrix()
+        # now with verify_grad
+        unittest_tools.verify_grad(Prod(axis=0), [x_val])
+        # second time, with some added complexity
+        # verify_grad takes the sum of the matrices anyway
+        def fn(x2):
+            return theano.tensor.sqr(Prod(axis=0)(x2))
+        unittest_tools.verify_grad(fn, [x_val])
 if __name__ == '__main__':
    unittest.main()
+    #suite = unittest.TestSuite([test_Prod('test_prod_grad')])
+    #unittest.TextTestRunner().run(suite)