merge

b4c881d1 · Dumitru Erhan · a25706e8 · 86ec00c0 · b4c881d1 · b4c881d1
--- a/doc/library/config.txt
+++ b/doc/library/config.txt
@@ -43,8 +43,10 @@ Environment Variables

 .. envvar:: THEANO_FLAGS

-    This is a list of comma-delimited key[=value] pairs that control Theano's behavior. A key that appears without an '=value' must be for a boolean value, and it acts as setting it to True.
-    
+    This is a list of comma-delimited key[=value] pairs that control
+    Theano's behavior. A key that appears without an '=value' must be
+    for a boolean value, and it acts as setting it to True.
+
    For example, in bash, you can override your :envvar:`THEANORC` defaults
    for <myscript>.py by typing this:

@@ -52,11 +54,15 @@ Environment Variables

        THEANO_FLAGS='floatX=float32,device=gpu0,nvcc.fastmath'  python <myscript>.py

+    If a value is defined several times in ``THEANO_FLAGS``,
+    the right-most definition is used. So, for instance, if
+    ``THEANO_FLAGS='device=cpu,device=gpu0'``, then gpu0 will be used.
+
 .. envvar:: THEANORC

    The location[s] of the .theanorc file[s] in ConfigParser format.
-    It defaults to ``$HOME/.theanorc``. 
-    
+    It defaults to ``$HOME/.theanorc``.
+
    Here is the .theanorc equivalent to the THEANO_FLAGS in the example above:

    .. code-block:: text
@@ -70,10 +76,10 @@ Environment Variables

    Multiple configuration files can be specified by separating them with ':'
    characters (as in $PATH).  Multiple configuration files will be merged,
-    with earlier (left-most) files taking priority over later files in the
+    with later (right-most) files taking priority over earlier files in the
    case that multiple files specify values for a common configuration option.
-    For example, to override system-wide settings with personal ones, 
-    set ``THEANORC=~/.theanorc:/etc/theanorc``
+    For example, to override system-wide settings with personal ones,
+    set ``THEANORC=/etc/theanorc:~/.theanorc``.

 The rest of this page describes some of the more common and important flags
 that you might want to use.  For the complete list (including documentation),

--- a/doc/tutorial/using_gpu.txt
+++ b/doc/tutorial/using_gpu.txt
@@ -58,7 +58,7 @@ file and run it.
    import numpy
    import time

-    vlen = 100000
+    vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
    iters = 1000

    rng = numpy.random.RandomState(22)
@@ -74,28 +74,31 @@ The program just computes the exp() of a bunch of random numbers.
 Note that we use the `shared` function to
 make sure that the input `x` are stored on the graphics device.

-If I run this program (in thing.py) with device=cpu, my computer takes a little over 3 seconds, whereas on the GPU it takes just over 0.2 seconds.  Note that the results are close but not identical!  The GPU will not always produce the exact same floating-point numbers as the CPU.
+If I run this program (in thing.py) with device=cpu, my computer takes a little over 7 seconds,
+whereas on the GPU it takes just over 0.4 seconds.  Note that the results are close but not
+identical!  The GPU will not always produce the exact same floating-point numbers as the CPU.
+As a point of reference, a loop that calls ``numpy.exp(x.value)`` also takes about 7 seconds.

 .. code-block:: text

    $ THEANO_FLAGS=mode=FAST_RUN,device=cpu python thing.py 
-    Looping 100 times took 3.12647008896 seconds
-    Result is [ 1.23178032  1.61879341  1.52278065 ...,  1.74085572  2.55530456 1.88906098]
+    Looping 100 times took 7.17374897003 seconds
+    Result is [ 1.23178032  1.61879341  1.52278065 ...,  2.20771815  2.29967753 1.62323285]

    bergstra@tikuanyin:~/tmp$ THEANO_FLAGS=mode=FAST_RUN,device=gpu0 python thing.py 
    Using gpu device 0: GeForce GTX 285
-    Looping 100 times took 0.217401981354 seconds
-    Result is [ 1.23178029  1.61879349  1.52278066 ...,  1.74085569 2.55530477 1.88906097]
+    Looping 100 times took 0.418929815292 seconds
+    Result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761 1.62323296]

 Returning a handle to device-allocated data
 -------------------------------------------

 The speedup is not greater in the example above because the function is
-returning its result as a numpy ndarray (which has already copied from the
-device to the host).  This is what makes it so easy to swap in device=gpu0, but
-if you want to be less portable, you can see a bigger speedup by changing
+returning its result as a numpy ndarray which has already been copied from the
+device to the host for your convenience.  This is what makes it so easy to swap in device=gpu0, but
+if you don't mind being less portable, you might prefer to see a bigger speedup by changing
 the graph to express a computation with a GPU-stored result.  The gpu_from_host
-op means "copy the input from the host to the gpu" and it is optimized away
+Op means "copy the input from the host to the gpu" and it is optimized away
 after the T.exp(x) is replaced by a GPU version of exp().

 .. code-block:: python
@@ -105,7 +108,7 @@ after the T.exp(x) is replaced by a GPU version of exp().
    import numpy
    import time

-    vlen = 100000
+    vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
    iters = 1000

    rng = numpy.random.RandomState(22)
@@ -123,17 +126,71 @@ The output from this program is
 .. code-block:: text

    Using gpu device 0: GeForce GTX 285
-    Looping 100 times took 0.173671007156 seconds
+    Looping 100 times took 0.185714006424 seconds
    Result is <CudaNdarray object at 0x3e9e970>
-    Numpy result is [ 1.23178029  1.61879349  1.52278066 ...,  1.74085569 2.55530477 1.88906097]
+    Numpy result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761 1.62323296]

-Here we've shaved off about 20% of the run-time by simply not copying the
+Here we've shaved off about 50% of the run-time by simply not copying the
 resulting array back to the host.
 The object returned by each function call is now not a numpy array but a
 "CudaNdarray" which can be converted to a numpy ndarray by the normal
 numpy casting mechanism.


+Running the GPU at Full Speed
+------------------------------
+
+To really get maximum performance in this simple example, we need to use an :class:`Out`
+instance to tell Theano not to copy the output it returns to us.  Theano allocates memory for
+internal use like a working buffer, but by default it will never return a result that is
+allocated in the working buffer.  This is normally what you want, but our example is so simple
+that it has the un-wanted side-effect of really slowing things down.
+
+.. 
+    TODO:
+    The story here about copying and working buffers is misleading and potentially not correct
+    ... why exactly does borrow=True cut 75% of the runtime ???
+
+.. code-block:: python
+
+    from theano import function, config, shared, sandbox, Out
+    import theano.tensor as T
+    import numpy
+    import time
+
+    vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
+    iters = 1000
+
+    rng = numpy.random.RandomState(22)
+    x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
+    f = function([], 
+            Out(sandbox.cuda.basic_ops.gpu_from_host(T.exp(x)),
+                borrow=True))
+    t0 = time.time()
+    for i in xrange(iters):
+        r = f()
+    print 'Looping 100 times took', time.time() - t0, 'seconds'
+    print 'Result is', r
+    print 'Numpy result is', numpy.asarray(r)
+
+Running this version of the code takes just under 0.05 seconds, over 140x faster than
+the CPU implementation!
+
+.. code-block:: text
+
+    Using gpu device 0: GeForce GTX 285
+    Looping 100 times took 0.0497219562531 seconds
+    Result is <CudaNdarray object at 0x31eeaf0>
+    Numpy result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761 1.62323296]
+
+This version of the code ``using borrow=True`` is slightly less safe because if we had saved
+the `r` returned from one function call, we would have to take care and remember that its value might
+be over-written by a subsequent function call.  Although borrow=True makes a dramatic difference in this example,
+be careful!  The advantage of
+borrow=True is much weaker in larger graphs, and there is a lot of potential for making a
+mistake by failing to account for the resulting memory aliasing.
+
+
 What can be accelerated on the GPU?
 ------------------------------------


--- a/theano/compile/function_module.py
+++ b/theano/compile/function_module.py
@@ -428,9 +428,20 @@ class Function(object):
        # Reinitialize each container's 'provided' counter
        for c in self.input_storage:
            c.provided = 0
+
        # Set positional arguments
-        for i, arg in enumerate(args):
-            self[i] = arg
+        i = 0
+        for arg in args:
+            #TODO: provide a Param option for skipping the filter if we
+            #      really want speed.
+            s = self.input_storage[i]
+            if arg is None:
+                s.storage[0] = arg
+            else:
+                s.storage[0] = s.type.filter(arg, strict=s.strict)
+            s.provided += 1
+            i+=1
+
        # Set keyword arguments
        for k, arg in kwargs.iteritems():
            self[k] = arg
@@ -448,7 +459,9 @@ class Function(object):
                            self.inv_finder[c]))

        # Do the actual work
+        t0_fn = time.time()
        self.fn()
+        dt_fn = time.time() - t0_fn

        # Retrieve the values that were computed
        outputs = [x.data for x in self.output_storage]
@@ -486,6 +499,9 @@ class Function(object):
          self.maker.mode.fct_call_time[self.name] += dt_call
          self.maker.mode.fct_call[self.name] += 1

+        self.maker.mode.call_time += dt_call
+        self.maker.mode.fn_time += dt_fn
+        
        if self.return_none:
            return None
        elif self.unpack_single and len(outputs) == 1:

--- a/theano/compile/mode.py
+++ b/theano/compile/mode.py
@@ -172,6 +172,8 @@ class Mode(object):
        if isinstance(optimizer, gof.Query):
            self.provided_optimizer = optimizer
        self._optimizer = optimizer
+        self.call_time = 0
+        self.fn_time = 0

    def __str__(self):
        return "Mode(linker = %s, optimizer = %s)" % (self.provided_linker, self.provided_optimizer)

--- a/theano/compile/profilemode.py
+++ b/theano/compile/profilemode.py
 import time, atexit, copy

-from theano.gof.link import WrapLinkerMany
+from theano.gof.link import WrapLinker
 from theano.gof.cutils import run_cthunk
 from theano.compile.mode import Mode, register_mode, predefined_modes, predefined_linkers, predefined_optimizers, default_linker, default_optimizer
 from theano.gof.cc import OpWiseCLinker
 from theano.gof.python25 import any
 from theano import gof
 from theano.configparser import config, AddConfigVar, IntParam
+from theano.compile.function_module import FunctionMaker

 import_time = time.time()

@@ -18,44 +19,57 @@ AddConfigVar('ProfileMode.n_ops_to_print',
        "Number of ops to print by default",
        IntParam(20, lambda i: i > 0))

+class Profile_Maker(FunctionMaker):
+    def create(self, input_storage=None, trustme=False):
+        ret = super(Profile_Maker,self).create(input_storage, trustme)
+        for i, node in enumerate(ret.maker.env.toposort()):
+            self.mode.apply_time[(i,node.op)]=0.0
+            self.mode.apply_call[(i,node.op)]=0
+#            self.mode.op_cimpl[node.op] = 
+
+        return ret

 class ProfileMode(Mode):
    def __init__(self, linker=default_linker, optimizer=default_optimizer):
        local_time = [0.0]
        apply_time = {}
        apply_call = {}
-        op_time = {}
        op_cimpl = {}
-        op_call = {}
        compile_time = 0 #time passed in theano.function()
        fct_call_time = {}#time passed inside theano fct call including op time.
        fct_call = {}

        self.__setstate__((linker, optimizer, local_time,
                           apply_time, apply_call,
-                           op_time, op_cimpl, op_call, 
+                           op_cimpl,
                           compile_time, fct_call_time, fct_call))

+    def function_maker(self, i,o,m, *args, **kwargs):
+        """Return an instance of `Profiler_Maker` which init the count"""
+
+        assert m is self
+        return Profile_Maker(i, o, self, *args, **kwargs)
+
    def __getstate__(self):
        #print "__getstate__",self.provided_linker,self.provided_optimizer
        return (self.provided_linker, self.provided_optimizer, self.local_time,
                self.apply_time, self.apply_call,
-                self.op_time, self.op_cimpl, self.op_call, self.compile_time, self.fct_call_time, self.fct_call)
+                self.op_cimpl, self.compile_time, self.fct_call_time, self.fct_call)

    def __setstate__(self, (linker, optimizer, local_time,
                            apply_time, apply_call,
-                            op_time, op_cimpl, op_call, 
+                            op_cimpl,
                            compile_time, fct_call_time, fct_call)):
        
        self.local_time = local_time
        self.apply_time = apply_time
        self.apply_call = apply_call
-        self.op_time = op_time
        self.op_cimpl = op_cimpl
-        self.op_call = op_call
        self.compile_time = compile_time
        self.fct_call_time = fct_call_time
        self.fct_call = fct_call
+        self.call_time = 0
+        self.fn_time = 0

        def blah(i, node, th):
            if hasattr(th, 'cthunk'):
@@ -63,7 +77,7 @@ class ProfileMode(Mode):
                failure = run_cthunk(th.cthunk)
                dt = time.time() - t0
                if failure:
-                    raise RuntimeError(('A C Op raised an exception.  PerformLinker cannot' 
+                    raise RuntimeError(('A C Op raised an exception.  PROFILE_MODE cannot' 
                        ' tell you what it was though.  Use a standard mode such as'
                        ' FAST_RUN_NOGC to correct the problem.'))
            else:
@@ -72,11 +86,9 @@ class ProfileMode(Mode):
                dt = time.time() - t0

            local_time[0] += dt
-            apply_time[(i,node.op)] = apply_time.get((i,node.op), 0.0) + dt
-            apply_call[(i,node.op)] = apply_call.get((i,node.op), 0) + 1
-            op_time[node.op] = op_time.get(node.op, 0.0) + dt
+            apply_time[(i,node.op)] += dt
+            apply_call[(i,node.op)] += 1
            op_cimpl[node.op] = hasattr(th, 'cthunk')
-            op_call[node.op] = op_call.get(node.op,0) + 1

        
        self.provided_linker = linker
@@ -84,7 +96,7 @@ class ProfileMode(Mode):
        if isinstance(linker, str) or linker is None:
            linker = predefined_linkers[linker]

-        linker = WrapLinkerMany([linker], [blah])
+        linker = WrapLinker([linker], blah)
            
        self.linker = linker
        if isinstance(optimizer, str) or optimizer is None:
@@ -113,18 +125,11 @@ class ProfileMode(Mode):
        fct_call = self.fct_call
        apply_time = self.apply_time
        apply_call = self.apply_call
-        op_time = self.op_time
-        op_call = self.op_call
        op_cimpl = self.op_cimpl

-        op_flops = {}
-        for a,t in op_time.items():
-            if hasattr(a,'flops'):
-                op_flops[a]=a.flops*op_call[a]/t/1e6
-
        self.print_summary_("print_summary",local_time, compile_time, fct_call_time, fct_call,
-                            apply_time, apply_call, op_time, op_call, op_cimpl,
-                            op_flops, n_apply_to_print, n_ops_to_print)
+                            apply_time, apply_call, op_cimpl,
+                            n_apply_to_print, n_ops_to_print)


    def print_diff_summary(self, other, n_apply_to_print=15, n_ops_to_print=20):
@@ -153,42 +158,23 @@ class ProfileMode(Mode):
                r[a]+=t
            return r
        
-        def diff_dict_flops(a_time,b_time_,a_call,b_call):
-            flops = {}
-            b_time = copy.copy(b_time_)
-            for a,ta in a_time.items():
-                tb = b_time.pop(a,0)
-                if hasattr(a,'flops'):
-                    flops[a]=a.flops*a_call[a]/ta - a.flops*b_call[a]/tb/1e6
-                
-            #they are missing in a
-            for b,tb in b_time.items():
-                if hasattr(b,'flops'):
-                    flops[b]=b.flops*b_call[b]/tb/1e6
-
-            return flops
-
        local_time = self.local_time[0]-other.local_time[0]
        compile_time = self.compile_time-other.compile_time
        fct_call_time = diff_dict(self.fct_call_time,other.fct_call_time)
        fct_call = diff_dict(self.fct_call,other.fct_call)
        apply_time = diff_dict(self.apply_time, other.apply_time)
        apply_call = diff_dict(self.apply_call, other.apply_call)
-        op_time = diff_dict(self.op_time, other.op_time)
-        op_call = diff_dict(self.op_call, other.op_call)
        op_cimpl = self.op_cimpl and other.op_cimpl

-        op_flops = diff_dict_flops(self.op_time, other.op_time, self.op_call, other.op_call)
-        
        self.print_summary_("print_diff_summary",local_time, compile_time, fct_call_time, fct_call,
-                            apply_time, apply_call, op_time, op_call, op_cimpl,
-                            op_flops, n_apply_to_print=n_apply_to_print,
+                            apply_time, apply_call, op_cimpl,
+                            n_apply_to_print=n_apply_to_print,
                            n_ops_to_print=n_ops_to_print, print_apply=False)

    @staticmethod
    def print_summary_(fct_name, local_time, compile_time, fct_call_time, fct_call,
-                       apply_time, apply_call, op_time, op_call, op_cimpl,
-                       op_flops=None, n_apply_to_print=15, n_ops_to_print=20, print_apply=True):
+                       apply_time, apply_call, op_cimpl,
+                       n_apply_to_print=15, n_ops_to_print=20, print_apply=True):
        """
        do the actual printing of print_summary and print_diff_summary.

@@ -218,6 +204,19 @@ class ProfileMode(Mode):
                      sum(f for f, t, a, nb_call in atimes[n_apply_to_print:])*100,
                      sum(t for f, t, a, nb_call in atimes[n_apply_to_print:]))

+        op_time = {}
+        op_call = {}
+        for (i,a),t in apply_time.items():
+            op_time.setdefault(a,0)
+            op_call.setdefault(a,0)
+            op_time[a]+=t
+            op_call[a]+=apply_call[(i,a)]
+
+        op_flops = {}
+        for a,t in op_time.items():
+            if hasattr(a,'flops'):
+                op_flops[a]=a.flops*op_call[a]/t/1e6
+
        flops_msg=''
        if op_flops:
            flops_msg=' <MFlops/s>'

--- a/theano/compile/tests/test_debugmode.py
+++ b/theano/compile/tests/test_debugmode.py
@@ -544,35 +544,20 @@ class Test_check_isfinite(unittest.TestCase):
        theano.tensor.TensorType.filter_checks_isfinite = self.old_val

    def test_check_isfinite(self):
-        x = theano.tensor.dvector()
+        x = theano.tensor.vector()
        f = theano.function([x], (x+2) * 5, mode='DEBUG_MODE')
+        g = theano.function([x], theano.tensor.log(x), mode='DEBUG_MODE')

        # this should work
        f(numpy.log([3, 4, 5]))

-        # this should raise InvalidValueError
-        try:
-            # insert a NaN
-            f(numpy.log([3, -4, 5]))
-            assert False
-        except debugmode.InvalidValueError:
-            pass
-
-        # this should raise InvalidValueError
-        try:
-            # insert an Nan and Inf
-            f(numpy.asarray([0, 1.0, 0])/0)
-            assert False
-        except debugmode.InvalidValueError:
-            pass
+        # passing an invalid value as an input should trigger ValueError
+        self.failUnlessRaises(ValueError, f, numpy.log([3, -4, 5]))
+        self.failUnlessRaises(ValueError, f, numpy.asarray([0, 1.0, 0])/0)
+        self.failUnlessRaises(ValueError, f, numpy.asarray([1.0, 1.0, 1.0])/0)

-        # this should raise InvalidValueError
-        try:
-            # insert several Inf
-            f(numpy.asarray([1.0, 1.0, 1.0])/0)
-            assert False
-        except debugmode.InvalidValueError:
-            pass
+        # generating an invalid value internally should trigger InvalidValueError
+        self.failUnlessRaises(debugmode.InvalidValueError, g, [3,-4,5])

        # this should disable the exception
        theano.tensor.TensorType.filter_checks_isfinite = False

--- a/theano/configparser.py
+++ b/theano/configparser.py
@@ -14,11 +14,12 @@ THEANO_FLAGS=os.getenv("THEANO_FLAGS","")
 # [section.]option[=value] entries. If the section part is omited, their should be only one
 # section with that contain the gived option.

-# THEANORC=~/.theanorc:~lisa/.theanorc
+# THEANORC can contain a colon-delimited list of config files, like
+# THEANORC=~lisa/.theanorc:~/.theanorc
+# In that case, definitions in files on the right (here, ~/.theanorc) have
+# precedence over those in files on the left.
 def config_files_from_theanorc():
    rval = [os.path.expanduser(s) for s in os.getenv('THEANORC', '~/.theanorc').split(':')]
-    rval.reverse()
-    print "THEANORC", rval
    return rval
 theano_cfg = ConfigParser.SafeConfigParser()
 theano_cfg.read(config_files_from_theanorc())
@@ -42,14 +43,15 @@ def fetch_val_for_key(key):
    """Return the overriding config value for a key.
    A successful search returs a string value.
    An unsuccessful search raises a KeyError
-    
-    The priority order is:
+
+    The (decreasing) priority order is:
    - THEANO_FLAGS
    - ~./theanorc
-    
+
    """

    # first try to find it in the FLAGS
+    rval = None
    for name_val in THEANO_FLAGS.split(','):
        if not name_val:
            continue
@@ -60,7 +62,12 @@ def fetch_val_for_key(key):
            name, val = name_val_tuple

        if name == key:
-            return val
+            # rval might be overriden by a later definition in THEANO_FLAGS
+            rval = val
+
+    # If an rval is found, it should be a string
+    if rval is not None:
+        return rval

    # next try to find it in the config file

@@ -77,7 +84,7 @@ def fetch_val_for_key(key):
        return theano_cfg.get(section, option)
    except (ConfigParser.NoOptionError, ConfigParser.NoSectionError):
        raise KeyError(key)
-    
+
 class TheanoConfigParser(object):
    #properties are installed by AddConfigVar

@@ -143,7 +150,7 @@ class ConfigParam(object):
            self.val = val

    deleter=None
-    
+
 class EnumStr(ConfigParam):
    def __init__(self, default, *options):
        self.default = default

--- a/theano/gof/type.py
+++ b/theano/gof/type.py
@@ -222,7 +222,7 @@ class PureType(object):
        try:
            self.filter(a, True)
            return True
-        except TypeError:
+        except (TypeError, ValueError):
            return False
    
    def make_variable(self, name = None):

--- a/theano/misc/safe_asarray.py
+++ b/theano/misc/safe_asarray.py
@@ -18,18 +18,22 @@ def _asarray(a, dtype=None, order=None):

    Currently, this issue has only been causing trouble when the target
    data type is 'int32', on some computers. As a result, this is the only
-    situation where we do more than a simple call to ``numpy.asarray``. If it
-    turns out that a similar problem can occur for more data type, this
+    situation where we may do more than a simple call to ``numpy.asarray``. If
+    it turns out that a similar problem can occur for more data type, this
    function should be updated accordingly.

    This function's name starts with a '_' to indicate that it is meant to be
    used internally. It is imported so as to be available directly through
        theano._asarray
    """
+    dtype = numpy.dtype(dtype)  # Convert into dtype object.
    rval = numpy.asarray(a, dtype=dtype, order=order)
-    if dtype is numpy.int32 or dtype == 'int32':
-        # Make sure the type is properly set to the correct type.
-        return rval.view(dtype=numpy.int32)
+    numpy_int32 = numpy.dtype(numpy.int32)
+    if (dtype is numpy_int32 and rval.dtype is not numpy_int32):
+        # Enfore the numpy.int32 dtype.
+        return rval.view(dtype=numpy_int32)
    else:
        # Using ``numpy.asarray`` should work just fine.
+        # Debug assert if we want to detect other failure cases (untested):
+        # assert rval.dtype is dtype
        return rval
--- a/theano/sandbox/conv.py
+++ b/theano/sandbox/conv.py
@@ -5,7 +5,17 @@ from theano import gof, Op, tensor, config
 from theano.printing import Print

 def getFilterOutShp(inshp, kshp, (dx,dy)=(1,1), mode='valid'):
-    """Returns numpy ndarray of len 2
+    """
+    Computes the shape (nb_rows, nb_col) of each output image.
+
+    :type inshp: tuple, list or 1D ndarray of length 2
+    :param inshp: shape of each (2D) input image
+    :type kshp: tuple, list or 1D ndarray of length 2
+    :param kshp: shape of each (2D) kernel filter
+    :type mode: string
+    :param mode: 'valid' or 'full' (see 'border_mode' in conv2d's doc)
+    :rtype: numpy 1D ndarray of len 2
+    :return: shape of each output "image" (or feature map)
    """
    if mode=='valid': s = -1
    else: s = 1
@@ -28,10 +38,12 @@ def conv2d(input, filters, border_mode='valid', subsample=(1,1),
    :param filters: tensor containing filters for convolutional neural net.
    Indexing is: (filter, filter input feature map, filter row, filter col).
    :type border_mode: string
-    :param border_mode:'valid'(only apply kernel over complete patch of the image)
-                       or 'full'(padd the image with 0 and apply the kernel over all full patch and partial patch of the image
+    :param border_mode:'valid'(only apply kernel over complete patch of the image) or
+    'full'(padd the image with 0 and apply the kernel over all full patch and partial patch of
+    the image
    :type subsample: tuple of len 2
-    :param subsample: how many pixel we move in the (row,col) direction of the image when we change of patch
+    :param subsample: how many pixel we move in the (row,col) direction of the image when we
+    change of patch
    :type image_shape: tuple of len 4
    :param image_shape: (batch size, stack size, nb row, nb col)
    :type filter_shape: tuple of len 4
@@ -60,18 +72,18 @@ def conv2d(input, filters, border_mode='valid', subsample=(1,1),

 class ConvOp(Op):
    """
-    A convolution op that should extend scipy.signal.convolve2d, but much faster!
+    A convolution op that should behave like scipy.signal.convolve2d,
+    but much faster!
    """

-
-    
    __attrnames = ['imshp', 'kshp', 'nkern', 'bsize', 'dx', 'dy', 'out_mode', 
            'unroll_batch', 'unroll_kern', 'unroll_patch',
            'imshp_logical', 'kshp_logical', 'kshp_logical_top_aligned']
    """These attributes uniquely identify the behaviour of this op for given inputs"""

-    def __init__(self, imshp=None, kshp=None, nkern=None, bsize=None, dx=None, dy=None, output_mode='valid',
-            unroll_batch=0,
+    def __init__(self, imshp=None, kshp=None, nkern=None, bsize=None, 
+            dx=None, dy=None,
+            output_mode='valid', unroll_batch=0,
            unroll_kern=0,
            unroll_patch=True,
            imshp_logical=None,
@@ -80,7 +92,12 @@ class ConvOp(Op):
            verbose=0,
            version=-1):
        """
-        This Op implement the convolution of a kernel(tensor 4d,(nkern, stacksize, nb row, nb col)) on an image(tensor 4d, (batchsize, stacksize, nb row, nb col). The batch size is multiple image that we want to apply the same kernel over. The nkern is numtiple kernel that we want to apply to each image. The stack size is mostly used when their is multiple layer in the network. It is the sum of the convolution of multiple 2d image and kernel.
+        This Op implement the convolution of a kernel(tensor 4d,(nkern, stacksize, nb row, nb
+        col)) on an image(tensor 4d, (batchsize, stacksize, nb row, nb col). The batch size is
+        multiple image that we want to apply the same kernel over. The nkern is numtiple kernel
+        that we want to apply to each image. The stack size is mostly used when their is
+        multiple layer in the network. It is the sum of the convolution of multiple 2d image
+        and kernel.

        The reason that this op does the summation over convolutions within the 'stack' is that
        it allows us to be memory-efficient about how gradients are calculated.  If, for
@@ -89,14 +106,22 @@ class ConvOp(Op):
        point) then we would have to sum over a potentially very large tensor to get the
        gradient on the filters.

-
-        If the imshp, kshp, nkern and bsize are provided, we can generate more optimal code. This make a significant difference for the full mode with unroll_patch version.
-        The most frequent faster code currently available on 64_x86 computer is unroll_batch=4, unroll_kern=4, unroll_patch=False and this request that all the optional shape information are gived. Those number are empirically tested and backed up by the article: Anatomy of High-Performance Matrix Multiplication by Kazushige Goto and Robert A. Van De Geijn, ACM Transactions on Mathematical Software, vol 34, No. 3, article 12, May 2008. It is in figure 12, it give the value mr x nr, those value are the optimum to use for unroll_batch and unroll_kern. For x86_64 bits computer it is 4x4. Other architecture can have different value.(2x4 for x86, 8x8 for itanium,...)
+        If the imshp, kshp, nkern and bsize are provided, we can generate more optimal code.
+        This make a significant difference for the full mode with unroll_patch version.  The
+        most frequent faster code currently available on 64_x86 computer is unroll_batch=4,
+        unroll_kern=4, unroll_patch=False and this request that all the optional shape
+        information are gived. Those number are empirically tested and backed up by the
+        article: Anatomy of High-Performance Matrix Multiplication by Kazushige Goto and Robert
+        A. Van De Geijn, ACM Transactions on Mathematical Software, vol 34, No. 3, article 12,
+        May 2008. It is in figure 12, it give the value mr x nr, those value are the optimum to
+        use for unroll_batch and unroll_kern. For x86_64 bits computer it is 4x4. Other
+        architecture can have different value.(2x4 for x86, 8x8 for itanium,...)

        :type out_mode: string
-        :param out_mode: 'valid'(give an output smaller then the image, 'full'(give an output bigger then the image)
+        :param out_mode: 'valid'(give an output smaller then the image, 'full'(give an output
+        bigger then the image)

-        optional parameter(if provided will be used to generate more optinal c code):
+        optional parameters: (will generate more optimal c code)
        
        :type imshp: tuple of len 2 or 3: 2 for 2d image, 3 for a stack of 2d images.
        :param imshp: (stacksize, nb image row, nb image col)
@@ -113,13 +138,17 @@ class ConvOp(Op):

        param to select the version of code used:
        :type unroll_patch: bool
-        :param unroll_patch: use a version of c_code that unroll the patch loop that don't request all shape information to work, but if all shape information are present, will use it to hardcode the value in the code for faster code.
-
+        :param unroll_patch: use a version of c_code that unroll the patch loop that don't
+        request all shape information to work, but if all shape information are present, will
+        use it to hardcode the value in the code for faster code.
        :type unroll_batch:int
-        :param unroll_batch: use a version of c_code that unroll the batch(by unroll_batch) and the nkern(by unroll_kern) loop. The size must by a multiple of bsize or nkern respectively.
+        :param unroll_batch: use a version of c_code that unroll the batch(by unroll_batch) and
+        the nkern(by unroll_kern) loop. The size must by a multiple of bsize or nkern
+        respectively.
        :type unroll_kern:int
-        :param unroll_kern: use a version of c_code that unroll the batch(by unroll_batch) and the nkern(by unroll_kern) loop. The size must by a multiple of bsize or nkern respectively.
-
+        :param unroll_kern: use a version of c_code that unroll the batch(by unroll_batch) and
+        the nkern(by unroll_kern) loop. The size must by a multiple of bsize or nkern
+        respectively.
        :type verbose: int
        :param verbose: passed to GpuConv
        :type version: int
@@ -130,26 +159,34 @@ class ConvOp(Op):
        :param kshp_logical_top_aligned: idem
        
        """
-        all_shape = imshp is not None and kshp is not None and nkern is not None and bsize is not None
+        all_shape = imshp is not None and kshp is not None and \
+                    nkern is not None and bsize is not None
+
        if (unroll_batch>0 or unroll_kern>0) and not all_shape:
            raise Exception("In ConvOp, when using unroll_batch and unroll_nkern, all shape are needed")
-        if not all_shape and (imshp is not None or kshp is not None or nkern is not None or bsize is not None):
-            print "OPTIMISATION WARNING: passing only a few shape to ConvOp for faster code is useless. We use all of them or none."
+
+        if not all_shape and (imshp is not None or kshp is not None \
+                or nkern is not None or bsize is not None):
+            print "OPTIMISATION WARNING: passing only a few shape to ConvOp "\
+                  "for faster code is useless. We use all of them or none."
        
        if not all_shape:
            unroll_patch = True

        if imshp is not None:
            imshp = tuple(imshp)
+
            if len(imshp)==2:
                imshp = (1,)+imshp
            elif len(imshp)==3:
                imshp = imshp
            else:
                raise Exception("bad len for imshp")
+
        self.imshp = imshp
        if kshp is not None:
            kshp = tuple(kshp)
+
        self.kshp = kshp
        self.nkern = nkern
        self.bsize=bsize
@@ -157,10 +194,12 @@ class ConvOp(Op):
        self.dy=dy
        self.verbose=verbose
        self.version=version
+
        # a triple
        self.imshp_logical = self.imshp
        if imshp_logical is not None: self.imshp_logical = tuple(imshp_logical)
-        assert (self.imshp is None and self.imshp_logical is None) or (len(self.imshp) == len(self.imshp_logical))
+        assert (self.imshp is None and self.imshp_logical is None) or \
+               (len(self.imshp) == len(self.imshp_logical))

        # a pair
        self.kshp_logical = self.kshp
@@ -172,6 +211,7 @@ class ConvOp(Op):
        self.unroll_patch=unroll_patch

        if self.unroll_batch>0 and self.bsize % self.unroll_batch!=0:
+
            if self.bsize<=self.unroll_batch:
                self.unroll_batch = self.bsize
            else:
@@ -181,9 +221,15 @@ class ConvOp(Op):
                while self.bsize % new!=0:
                    new-=1

-                print "OPTIMISATION WARNING: in ConvOp.__init__() unroll_batch(%s) must be 0 or a divisor of bsize(%s). We revert it to %d. This won't change the result, but may make it slower."%(str(self.unroll_batch),str(self.bsize),new)
+                print "OPTIMISATION WARNING: in ConvOp.__init__() unroll_batch(%s)"\
+                      "must be 0 or a divisor of bsize(%s). We revert it to %d. This"\
+                      "won't change the result, but may make it slower."%\
+                      (str(self.unroll_batch),str(self.bsize),new)
+
                self.unroll_batch=new
+
        if self.unroll_kern>0 and self.nkern % unroll_kern!=0:
+
            if self.nkern<=self.unroll_kern:
                self.unroll_kern = self.nkern
            else:
@@ -192,22 +238,29 @@ class ConvOp(Op):
                assert(new>=1)
                while self.nkern % new!=0:
                    new-=1
-                print "OPTIMISATION WARNING: in ConvOp.__init__() unroll_kern(%s) should be 0 or a divisor of nkern(%s)We revert it to %d. This won't change the result, but may make it slower."%(str(self.unroll_kern),str(self.nkern),new)
+                print "OPTIMISATION WARNING: in ConvOp.__init__() unroll_kern(%s)"\
+                      "should be 0 or a divisor of nkern(%s)We revert it to %d."\
+                      "This won't change the result, but may make it slower."\
+                      %(str(self.unroll_kern),str(self.nkern),new)
                self.unroll_kern=new
+
        if all_shape:
            self.outshp = getFilterOutShp(self.imshp_logical, self.kshp_logical, (dx,dy), output_mode)
            self.fulloutshp = getFilterOutShp(self.imshp_logical, self.kshp_logical, (1,1), output_mode)
        else:
            self.outshp = None
            self.fulloutshp = None
+
        self.out_mode = output_mode
+
        if not self.out_mode in ["valid", "full"]:
            raise Exception("Mode %s not implemented"%self.out_mode)
       
        if all_shape and not (self.outshp > 0).all():
-            raise Exception(("Bad size for the output shape. Verify that [post-supersampling] input shape (%s)"
-                "and kern shape(%s) are ok. (hint: kerns must fit inside image in"
-                "'valid' mode)")%(self.imshp_logical,self.kshp_logical))
+            raise Exception(("Bad size for the output shape. Verify that [post-"\
+                    "supersampling] input shape (%s) and kern shape(%s) are ok. "\
+                    "(Hint: kerns must fit inside image in valid mode)")%
+                    (self.imshp_logical,self.kshp_logical))

        self._rehash()
        if config.op.set_flops:
@@ -244,11 +297,16 @@ class ConvOp(Op):
            self.flops*=self.outshp[0]*self.outshp[1]#nb flops by output image
            self.flops*=self.imshp[0]*self.nkern*self.bsize#for all outputs images#n_stack==self.imshp[0]
        else: #full mode not implemented
+
            self.flops=0
            for out_row in range(self.outshp[0]):#loop over output row
                for out_col in range(self.outshp[0]):#loop over output col
                    for row in range(self.kshp[0]):#loop over kern row
-                        if row+out_row-self.kshp[0]+1<0 or row+out_row-self.kshp[0]+1>=self.imshp[1]: continue
+
+                        if (row+out_row-self.kshp[0]+1<0 or 
+                            row+out_row-self.kshp[0]+1>=self.imshp[1]): 
+                            continue
+
                        col=0
                        max_col=self.kshp[1]
                        img_col=out_col-self.kshp[1]+1
@@ -263,7 +321,8 @@ class ConvOp(Op):
            
            self.flops*=self.imshp[0]*self.nkern*self.bsize#for all outputs images#n_stack==self.imshp[0]
            
-            assert self.flops==self.bsize * self.nkern * self.imshp[0] * self.kshp[0] * self.kshp[1] * self.imshp[1] * self.imshp[2] * 2
+            assert self.flops == self.bsize * self.nkern * self.imshp[0] * \
+                    self.kshp[0] * self.kshp[1] * self.imshp[1] * self.imshp[2] * 2

    def make_node(self, inputs, kerns):
        # TODO: find a way to make ConvOp work for N-D (after NIPS09)
@@ -375,21 +434,25 @@ class ConvOp(Op):
    def grad(self, (inputs, kerns), (gz,)):
        """
        In development. Works for test cases in test_sp.py
-        A few known issues:
-        * doesn't work for rectangular images or filters
-        * inputs needs to be a 4D tensor. Couldn't get 3D to work
-        * will crash if filter the same size as input image
+        WARNING: a few known issues:
+            * doesn't work for rectangular images or filters
+            * inputs needs to be a 4D tensor. Couldn't get 3D to work
+            * will crash if filter the same size as input image
        """
        if self.imshp != self.imshp_logical or self.kshp != self.kshp_logical:
            raise NotImplementedError('todo')

        if self.dx!=1 or self.dy!=1:
-            raise Exception("ERROR: We disable ConvOp.grad now when dx!=1 or dy!=1 as we think their is a high probability of bug in it. We need to raise the error on the gradient to .1!")
+            raise Exception("ERROR: We disable ConvOp.grad now when dx!=1 or "\
+                    "dy!=1 as we think their is a high probability of bug in it."\
+                    "We need to raise the error on the gradient to .1!")

-        all_shape = self.imshp is not None and self.kshp is not None and self.nkern is not None and self.bsize is not None
+        all_shape = self.imshp is not None and self.kshp is not None and \
+                    self.nkern is not None and self.bsize is not None

        if not all_shape and (self.dx!=1 or self.dy!=1):
-            raise Exception("ConvOp.grad when dx!=1 or dy!=1 we must have all the optional shape information")
+            raise Exception("ConvOp.grad when dx!=1 or dy!=1 we must have all "\
+                            "the optional shape information")
        
        grad_hack_necessary = False
        if grad_hack_necessary:
@@ -411,6 +474,7 @@ class ConvOp(Op):
        kshp = None
        un_p = self.unroll_patch
        imshp_logical = None
+
        if self.out_mode == 'valid':
            (img, filters) = (newin, newgz)
            kshp_logical = self.fulloutshp
@@ -445,13 +509,17 @@ class ConvOp(Op):
                un_b = bsize
            else:
                un_b = 1
-                print "OPTIMISATION WARNING: in ConvOp.grad() we can't determine a good unroll value for the batch. Maybe you can optimize this!", bsize, un_b, self.unroll_batch, self.unroll_kern
+                print "OPTIMISATION WARNING: in ConvOp.grad() we can't determine "\
+                      "a good unroll value for the batch. Maybe you can optimize this!",\
+                      bsize, un_b, self.unroll_batch, self.unroll_kern
+
        if un_k!=0 and nkern%un_k!=0:
            if nkern<un_k:
                un_k = nkern
            else:
                un_k = 1
-                print "OPTIMISATION WARNING: in ConvOp.grad() we can't determine a good unroll value for the kernel. Maybe you can optimize this!"
+                print "OPTIMISATION WARNING: in ConvOp.grad() we can't determine "\
+                      "a good unroll value for the kernel. Maybe you can optimize this!"

        dw = ConvOp(imshp, kshp, nkern, bsize, 1,1, output_mode='valid',
                    unroll_batch=un_b, unroll_kern=un_k, unroll_patch=un_p,
@@ -460,9 +528,12 @@ class ConvOp(Op):
                    kshp_logical_top_aligned=kshp_logical_top_aligned,
                    version=self.version,
                    verbose=self.verbose)
+
        if hasattr(self,'flops'):
            dw.set_flops()
+
        dw = dw(img,filters)
+
        if all_shape:
            assert (dw.owner.op.outshp==self.kshp).all()
        if self.out_mode == 'valid':
@@ -472,18 +543,21 @@ class ConvOp(Op):

        ####### Determine gradient on inputs ########
        mode = 'valid'
-        if not self.out_mode == 'full': mode = 'full'
+        if not self.out_mode == 'full': 
+            mode = 'full'
+
        filters = kerns.dimshuffle((1,0,2,3))
        filters = filters[:,:,::-1,::-1]
        nkern = None
        imshp = None
        imshp_logical = None
        kshp = None
+
        if all_shape:
            nkern = self.imshp[0]
            imshp = (self.nkern, self.outshp[0], self.outshp[1])
            imshp_logical=(self.nkern, self.fulloutshp[0], self.fulloutshp[1])
-        #print 'din', imshp, self.kshp, nkern
+
        din = ConvOp(imshp, self.kshp, nkern, self.bsize, 
                     1,1, output_mode=mode,
                     unroll_batch=un_b, unroll_kern=un_k, unroll_patch=un_p,
@@ -491,10 +565,14 @@ class ConvOp(Op):
                     kshp_logical=None,
                     version=-1,#we we change the mode, we don't forward the version.
                     verbose=self.verbose)
+
        if hasattr(self,'flops'):
            din.set_flops()
+
        din = din(gz,filters)
-        assert (din.owner.op.outshp is None and self.imshp is None) or (din.owner.op.outshp==self.imshp[1:]).all()
+
+        assert (din.owner.op.outshp is None and self.imshp is None) or \
+               (din.owner.op.outshp==self.imshp[1:]).all()
        return [din, dw]

    def c_headers(self):
@@ -512,8 +590,10 @@ class ConvOp(Op):
 #define MOD %
 using namespace std;
 """ + tensor.blas.blas_header_text()
+
    def c_libraries(self):
        return tensor.blas.ldflags()
+
    def c_code(self, node, name, (img2d, filtersflipped), (z, ), sub):
        if node.inputs[0].type.dtype != node.inputs[1].type.dtype:
            raise NotImplementedError()
@@ -521,7 +601,8 @@ using namespace std;
        d=locals()
        d.update(sub)

-        all_shape = self.imshp is not None and self.kshp is not None and self.nkern is not None and self.bsize is not None
+        all_shape = self.imshp is not None and self.kshp is not None and \
+                    self.nkern is not None and self.bsize is not None

        d["self_out_mode"]=self.out_mode
        d["self_dx"]=self.dx
@@ -587,7 +668,7 @@ using namespace std;

        if self.unroll_patch:
            if self.verbose:
-                print "return unroll patch version",self.dx,self.dy
+                print "return unroll patch version. all_shape=", all_shape
            return _conv_op_code_unroll_patch%d
        if self.unroll_batch>0 or self.unroll_kern>0:
            if self.unroll_batch<=0: self.unroll_batch=1
@@ -607,44 +688,6 @@ using namespace std;
                print "return no gemm version"
            return _conv_op_code_a % d

-def convolve2(kerns, kshp, nkern, images, imshp, bsize, step=(1,1),
-              bias=None, mode='valid', **d):
-    """
-    param kerns: kernel tensor
-    param kshp:  tuple(kern row, kern wid)
-    param nkern: int the number of kernel
-    param images:image tensor
-    param imshp: tuple([stack size,] image row, image wid)
-    param bsize: batch size
-    param step:  subsampling to apply to the output tuple(row, wid)
-    param bias:  if True, will add a bias
-    param mode:  'valid' or 'full'
-    return:      tuple(theano graph with the output of ConvOp flattened to 2 dimensions, ?)
-    """
-    #TODO: remove the bias argument from this function because convolution has nothing to do with a bias
-
-    # if imshp, is a tuple, images contains one input dimension
-    if len(imshp)!=3:
-        nvis_dim = 1
-    else: nvis_dim = imshp[0]
-
-    # all these reshapes should happen in place
-    imrshp   = tensor.as_tensor([bsize] + list(imshp))
-    imtensor = tensor.reshape(images, imrshp)
-
-    kernrshp   = tensor.as_tensor([nkern, nvis_dim] + list(kshp))
-    kerntensor = tensor.reshape(kerns, kernrshp)
- 
-    convop = ConvOp(imshp, kshp, nkern, bsize, step[0], step[1],
-                    output_mode=mode, **d)
-    convout = convop(imtensor, kerntensor)
-   
-    if bias:
-        biastensor = tensor.DimShuffle((False,), ('x',0,'x','x'), inplace=True)(bias)
-        convout = convout + biastensor
-        
-    rval = tensor.flatten(convout, 2)
-    return rval, N.hstack((nkern, convop.outshp))

 _conv_op_code_a = """
 const int mode=%(mode)s;

--- a/theano/sandbox/cuda/__init__.py
+++ b/theano/sandbox/cuda/__init__.py
@@ -5,14 +5,15 @@ from theano import config
 import logging, copy
 _logger_name = 'theano.sandbox.cuda'
 _logger = logging.getLogger(_logger_name)
-_logger.setLevel(logging.INFO)
-_logger.addHandler(logging.StreamHandler())
+_logger.setLevel(logging.WARNING)
+def error(*msg):
+    _logger.warning('ERROR (%s): '% ( _logger_name, ' '.join(str(m) for m in msg)))
 def warning(*msg):
-    _logger.warning(_logger_name+'WARNING: '+' '.join(str(m) for m in msg))
+    _logger.warning('WARNING (%s): '% ( _logger_name, ' '.join(str(m) for m in msg)))
 def info(*msg):
-    _logger.info(_logger_name+'INFO: '+' '.join(str(m) for m in msg))
+    _logger.warning('INFO (%s): '% ( _logger_name, ' '.join(str(m) for m in msg)))
 def debug(*msg):
-    _logger.debug(_logger_name+'DEBUG: '+' '.join(str(m) for m in msg))
+    _logger.warning('DEBUG (%s): '% ( _logger_name, ' '.join(str(m) for m in msg)))


 # Compile cuda_ndarray.cu
@@ -63,23 +64,32 @@ if not compile_cuda_ndarray:
    except ImportError:
        compile_cuda_ndarray = True

-if compile_cuda_ndarray:
-    import nvcc_compiler
-    if not nvcc_compiler.is_nvcc_available():
-        set_cuda_disabled()
+try:
+    if compile_cuda_ndarray:
+        import nvcc_compiler
+        if not nvcc_compiler.is_nvcc_available():
+            set_cuda_disabled()

-    if enable_cuda:
-        code = open(os.path.join(cuda_path, "cuda_ndarray.cu")).read()
+        if enable_cuda:
+            code = open(os.path.join(cuda_path, "cuda_ndarray.cu")).read()

-        if not os.path.exists(cuda_ndarray_loc):
-            os.makedirs(cuda_ndarray_loc)
+            if not os.path.exists(cuda_ndarray_loc):
+                os.makedirs(cuda_ndarray_loc)

-        nvcc_compiler.nvcc_module_compile_str('cuda_ndarray', code, location = cuda_ndarray_loc,
-                                              include_dirs=[cuda_path], libs=['cublas'])
+            nvcc_compiler.nvcc_module_compile_str('cuda_ndarray', code, location = cuda_ndarray_loc,
+                                                  include_dirs=[cuda_path], libs=['cublas'])

-        from cuda_ndarray.cuda_ndarray import *
+            from cuda_ndarray.cuda_ndarray import *
+except Exception, e:
+    error( "Failed to compile cuda_ndarray.cu: %s" % str(e))
+    set_cuda_disabled()

 if enable_cuda:
+    #check if their is an old cuda_ndarray that was loading instead of the one we compiled!
+    import cuda_ndarray.cuda_ndarray
+    if os.path.join(config.compiledir,'cuda_ndarray','cuda_ndarray.so')!=cuda_ndarray.cuda_ndarray.__file__:
+        _logger.warning("WARNING: cuda_ndarray was loaded from",cuda_ndarray.cuda_ndarray.__file__,"This is not expected as theano should compile it automatically for you. Do you have a directory called cuda_ndarray in your LD_LIBRARY_PATH environment variable? If so, please remove it as it is outdated!")
+
    from theano.sandbox.cuda.type import CudaNdarrayType
    from theano.sandbox.cuda.var import (CudaNdarrayVariable,
            CudaNdarrayConstant,
@@ -103,7 +113,7 @@ def use(device=config.device):
        raise ValueError("Invalid device identifier", device)
    if use.device_number is None:
        # No successful call to use() has been made yet
-        if device=="-1" or device=="CPU":
+        if device<0:
            return
        if device in [None,""]:
            device=0
@@ -134,6 +144,5 @@ def handle_shared_float32(tf):
    else:
        raise NotImplementedError('removing our handler')

-
 if enable_cuda and config.device.startswith('gpu'):
    use()
--- a/theano/sandbox/cuda/nvcc_compiler.py
+++ b/theano/sandbox/cuda/nvcc_compiler.py
@@ -6,6 +6,13 @@ from theano import config
 _logger=logging.getLogger("theano.sandbox.cuda.nvcc_compiler")
 _logger.setLevel(logging.WARN)

+from theano.configparser import config, AddConfigVar, StrParam
+
+AddConfigVar('nvcc.compiler_bindir',
+        "if defined, nvcc compiler driver will seek g++ and gcc in this directory",
+        StrParam(""))
+
+
 def error(*args):
    #sys.stderr.write('ERROR:'+ ' '.join(str(a) for a in args)+'\n')
    _logger.error("ERROR: "+' '.join(str(a) for a in args))
@@ -68,6 +75,8 @@ def nvcc_module_compile_str(module_name, src_code, location=None, include_dirs=[
    debug('Generating shared lib', lib_filename)
    # TODO: Why do these args cause failure on gtx285 that has 1.3 compute capability? '--gpu-architecture=compute_13', '--gpu-code=compute_13', 
    cmd = ['nvcc', '-shared', '-g'] + [pa for pa in preargs if pa.startswith('-O')]
+    if config.nvcc.compiler_bindir:
+        cmd.extend(['--compiler-bindir', config.nvcc.compiler_bindir])
    cmd.extend(['-Xcompiler', ','.join(pa for pa in preargs if not pa.startswith('-O'))])
    cmd.extend('-I%s'%idir for idir in include_dirs)
    cmd.extend(['-o',lib_filename]) 

--- a/theano/sandbox/cuda/tests/test_basic_ops.py
+++ b/theano/sandbox/cuda/tests/test_basic_ops.py
@@ -140,20 +140,20 @@ def test_elemwise1():
    b = tensor.fmatrix()

    #let debugmode catch any mistakes
-    print >> sys.stderr, "STARTING FUNCTION 1"
+    print >> sys.stdout, "STARTING FUNCTION 1"
    f = pfunc([b], [], updates=[(a, b**a)], mode=mode_with_gpu)
    for i, node in enumerate(f.maker.env.toposort()):
        print i, node
    f(numpy.random.rand(*shape)+0.3)

-    print >> sys.stderr, "STARTING FUNCTION 2"
+    print >> sys.stdout, "STARTING FUNCTION 2"
    #let debugmode catch any mistakes
    f = pfunc([b], [], updates=[(a, tensor.exp(b**a))], mode=mode_with_gpu)
    for i, node in enumerate(f.maker.env.toposort()):
        print i, node
    f(numpy.random.rand(*shape)+0.3)

-    print >> sys.stderr, "STARTING FUNCTION 3"
+    print >> sys.stdout, "STARTING FUNCTION 3"
    #let debugmode catch any mistakes
    f = pfunc([b], [], updates=[(a, a+b * tensor.exp(b**a))], mode=mode_with_gpu)
    f(numpy.random.rand(*shape)+0.3)
@@ -169,11 +169,11 @@ def test_elemwise2():
        f = pfunc([b], [], updates=[(a, (a+b).dimshuffle(pattern))], mode=mode_with_gpu)
        has_elemwise = False
        for i, node in enumerate(f.maker.env.toposort()):
-            print >> sys.stderr, i, node
+            print >> sys.stdout, i, node
            has_elemwise = has_elemwise or isinstance(node.op, tensor.Elemwise)
        assert not has_elemwise
        #let debugmode catch errors
-        print >> sys.stderr, 'pattern', pattern
+        print >> sys.stdout, 'pattern', pattern
        f(rng.rand(*shape)*.3)
    
    shape = (3,4,5,6)
@@ -204,7 +204,7 @@ def test_elemwise3():
        b**a).dimshuffle([2,0,3,1]))], mode=mode_with_gpu)
    has_elemwise = False
    for i, node in enumerate(f.maker.env.toposort()):
-        print >> sys.stderr, i, node
+        print >> sys.stdout, i, node
        has_elemwise = has_elemwise or isinstance(node.op, tensor.Elemwise)
    assert not has_elemwise
    #let debugmode catch errors
@@ -220,7 +220,7 @@ def test_elemwise4():
    f = pfunc([b,c], [], updates=[(a, (a+b.dimshuffle('x', 0)*c.dimshuffle(0, 'x')))], mode=mode_with_gpu)
    has_elemwise = False
    for i, node in enumerate(f.maker.env.toposort()):
-        print >> sys.stderr, i, node
+        print >> sys.stdout, i, node
        has_elemwise = has_elemwise or isinstance(node.op, tensor.Elemwise)
    assert not has_elemwise
    #let debugmode catch errors

--- a/theano/sandbox/cuda/tests/test_conv_cuda_ndarray.py
+++ b/theano/sandbox/cuda/tests/test_conv_cuda_ndarray.py
@@ -360,7 +360,7 @@ def test_subsample():

 def test_logical_shapes():
    # implement when
-    print >> sys.stderr, "INFO: test_logical_shapes not implemented (i.e. imshp_logical, kshp_logical, kshp_logical_top_aligned)"
+    print >> sys.stderr, "WARNING TODO: test_logical_shapes not implemented (i.e. imshp_logical, kshp_logical, kshp_logical_top_aligned)"


 def _test_dummy():

--- a/theano/sandbox/cuda/tests/test_cuda_ndarray.py
+++ b/theano/sandbox/cuda/tests/test_cuda_ndarray.py
@@ -8,7 +8,7 @@ if cuda_ndarray.enable_cuda == False:
 import numpy

 def test_host_to_device():
-    print >>sys.stderr, 'starting test_host_to_dev'
+    print >>sys.stdout, 'starting test_host_to_dev'
    for shape in ((), (3,), (2,3), (3,4,5,6)):
        a = theano._asarray(numpy.random.rand(*shape), dtype='float32')
        b = cuda_ndarray.CudaNdarray(a)
@@ -53,7 +53,7 @@ def test_add():


 def test_exp():
-    print >>sys.stderr, 'starting test_exp'
+    print >>sys.stdout, 'starting test_exp'
    for shape in ((), (3,), (2,3), (1,10000000),(10,1000000), (100,100000),(1000,10000),(10000,1000)):
        a0 = theano._asarray(numpy.random.rand(*shape), dtype='float32')
        a1 = a0.copy()
@@ -74,25 +74,25 @@ def test_exp():


 def test_copy():
-    print >>sys.stderr, 'starting test_copy'
+    print >>sys.stdout, 'starting test_copy'
    shape = (5,)
    a = theano._asarray(numpy.random.rand(*shape), dtype='float32')

-    print >>sys.stderr, '.. creating device object'
+    print >>sys.stdout, '.. creating device object'
    b = cuda_ndarray.CudaNdarray(a)

-    print >>sys.stderr, '.. copy'
+    print >>sys.stdout, '.. copy'
    c = copy.copy(b)
-    print >>sys.stderr, '.. deepcopy'
+    print >>sys.stdout, '.. deepcopy'
    d = copy.deepcopy(b)

-    print >>sys.stderr, '.. comparisons'
+    print >>sys.stdout, '.. comparisons'
    assert numpy.allclose(a, numpy.asarray(b))
    assert numpy.allclose(a, numpy.asarray(c))
    assert numpy.allclose(a, numpy.asarray(d))

 def test_dot():
-    print >>sys.stderr, 'starting test_dot'
+    print >>sys.stdout, 'starting test_dot'
    a0 = theano._asarray(numpy.random.rand(4, 7), dtype='float32')
    a1 = theano._asarray(numpy.random.rand(7, 6), dtype='float32')

@@ -101,7 +101,7 @@ def test_dot():

    assert numpy.allclose(numpy.dot(a0, a1), cuda_ndarray.dot(b0, b1))

-    print >> sys.stderr, 'WARNING test_dot: not testing all 8 transpose cases of dot'
+    print >> sys.stderr, 'WARNING TODO test_dot: not testing all 8 transpose cases of dot'

 def test_sum():
    shape = (2,3)
@@ -147,7 +147,7 @@ def test_reshape():
             ]

    def subtest(shape_1, shape_2):
-        #print >> sys.stderr, "INFO: shapes", shape_1, shape_2
+        #print >> sys.stdout, "INFO: shapes", shape_1, shape_2
        a = theano._asarray(numpy.random.rand(*shape_1), dtype='float32')
        b = cuda_ndarray.CudaNdarray(a)


--- a/theano/sandbox/downsample.py
+++ b/theano/sandbox/downsample.py
@@ -147,7 +147,7 @@ class DownsampleFactorMaxGrad(Op):
    def c_code_cache_version(self):
        return ()

-                
+
 def max_pool2D(input, ds, ignore_border=False):
    """
    Takes as input a N-D tensor, where N >= 2. It downscales the input image by
@@ -166,7 +166,7 @@ def max_pool2D(input, ds, ignore_border=False):

    # extract image dimensions
    img_shape = input.shape[-2:]
-    
+
    # count the number of "leading" dimensions, store as dmatrix
    batch_size = tensor.prod(input.shape[:-2])
    batch_size = tensor.shape_padright(batch_size,1)

--- a/theano/sandbox/test_conv.py
+++ b/theano/sandbox/test_conv.py
@@ -7,7 +7,7 @@ from theano.tests import unittest_tools as utt

 from theano import function, Mode
 import theano.tensor as T
-from conv import ConvOp, convolve2, getFilterOutShp
+from conv import ConvOp, getFilterOutShp

 def flip(kern, kshp):
    "flip the kernel as scipy.convolv2d do it flipped."
@@ -41,7 +41,7 @@ def flip(kern, kshp):
 global_rng = N.random.RandomState(3423489)

 dmatrix4=T.TensorType('float64', (False, False, False, False))
-def exec_multilayer_conv_nnet(conv_mode, ss, bsize, imshp, kshps, nkerns, unroll_batch=0, unroll_kern=0, img=T.dmatrix(), validate=True, conv_op_py=False, do_convolve2=False, do_print=True, repeat=1, unroll_patch=0):
+def exec_multilayer_conv_nnet(conv_mode, ss, bsize, imshp, kshps, nkerns, unroll_batch=0, unroll_kern=0, img=T.dmatrix(), validate=True, conv_op_py=False, do_print=True, repeat=1, unroll_patch=False, unroll_patch_size=False, verbose=0):

        # build actual input images
        imgval = global_rng.rand(bsize, imshp[0], imshp[1], imshp[2])
@@ -92,41 +92,13 @@ def exec_multilayer_conv_nnet(conv_mode, ss, bsize, imshp, kshps, nkerns, unroll
                                imgval[b,i,...], w_flip[n,i,...],1,val, bval, 0)[0::ss[0],0::ss[1]]
                ntot += time.time() - time1

-            if do_convolve2:
-                ####### test with new sp.convolve2 function ######
-                time1 = time.time()
-                hid, outshp2 = convolve2(kern, kshp, nkern, img, imshp,  
-                                         bsize, (ss[0],ss[1]), mode=conv_mode)
-                propup = function([kern, img], hid)
-                propup1 = function([kern, img], hid,mode=Mode(linker="py"))
-
-                hidval  = propup(w_flip.reshape(nkern,-1), imgval.reshape(bsize,-1))
-                hidval  = hidval.reshape(bsize,nkern,outshp2[-2],outshp2[-1])
-#                hidval = hidval[:,:,::ss[0],::ss[1]]
-                hidval = hidval.reshape(bsize, -1)
-                for i in range(repeat):
-                    hidval1 = propup1(w_flip.reshape(nkern,-1), imgval.reshape(bsize,-1))
-                hidval1  = hidval1.reshape(bsize,nkern,outshp2[-2],outshp2[-1])
-#                hidval1  = hidval1[:,:,::ss[0],::ss[1]]
-                hidval1 = hidval1.reshape(bsize, -1)
-
-                assert (N.abs(hidval-hidval1)<1e-5).all()
-                temp = N.abs(outval.reshape(bsize,-1) - hidval)
-                if validate:
-                    assert (temp < 1e-5).all()
-
-            else:
-                hid = img #we don't need it, but it make the flow easier flow
-                hidval=outval.copy()#to keep the same memory
-                hidval1=outval.copy()
-
            # ConvOp
-            if unroll_patch:
+            if unroll_patch and not unroll_patch_size:
                conv_op = ConvOp(dx=ss[0],dy=ss[1], output_mode=conv_mode,
-                                 unroll_patch=unroll_patch)(inputs4, kerns4)
+                                 unroll_patch=unroll_patch, verbose=verbose)(inputs4, kerns4)
            else:
                conv_op = ConvOp(imshp, kshp, nkern, bsize, ss[0],ss[1], conv_mode,
-                                 unroll_batch=unroll_batch, unroll_kern=unroll_kern, unroll_patch=unroll_patch)(inputs4, kerns4)
+                                 unroll_batch=unroll_batch, unroll_kern=unroll_kern, unroll_patch=unroll_patch, verbose=verbose)(inputs4, kerns4)
            l1shp=N.hstack((nkern,
                            getFilterOutShp(imshp, kshp, ss, conv_mode)))
            propup2 = function([inputs4, kerns4], conv_op)
@@ -155,7 +127,7 @@ def exec_multilayer_conv_nnet(conv_mode, ss, bsize, imshp, kshps, nkerns, unroll
                temp = N.abs(outval - hidval3)
                assert (temp < 1e-5).all()

-            img, imshp = hid, tuple(outshp)
+            imshp = tuple(outshp)
            imgval = outval.reshape(bsize,outshp[0],outshp[1],outshp[2])

        return tctot, tpytot, ntot
@@ -246,23 +218,9 @@ class TestConvOp(unittest.TestCase):
 #                    print 'img2d', img2d
                    img1d = img2d.reshape(bsize,-1)

-                    # create filters (need to be flipped to use convolve2d)
+                    # create filters
                    filtersflipped = flip(filters.reshape((nkern,)+kshp), kshp)

-                    # compute with new convolve2 (no timing info)
-                    output4, outshp4  = convolve2(kerns, kshp, nkern, input,\
-                            imshp, bsize, (ss[0],ss[1]), bias=bias, mode=conv_mode)
-#                    print 'output4', output4
-
-                    ttime1 = time.time()
-                    f = function([kerns, bias, input], output4)
-                    out4 = f(filtersflipped.reshape(nkern,-1), biasvals, img1d)
-#                    print 'out4', out4, img1d, filtersflipped
-                    tconv2 += [time.time() - ttime1]
-                    out4 = out4.reshape(bsize, nkern, outshp4[1], outshp4[2])
-                    out4 = out4#[:,:,0::ss[0],0::ss[1]]
-                    out4 = out4.reshape(bsize, -1)
-
                    # compute with ConvOp
                    dmatrix3=T.TensorType('float64', (False, False, False))
                    inputs4=dmatrix4()
@@ -307,9 +265,6 @@ class TestConvOp(unittest.TestCase):
                    # compare benchmark with ConvOp
                    temp = bench1.flatten() - out2.flatten()
                    assert (temp < 1e-5).all()
-                    # compare benchmark with convolve2
-                    temp = bench1.flatten() - out4.flatten()
-                    assert (temp < 1e-5).all()
                    
        print '**** Convolution Profiling Results ****'
        print 'Scipy convolve2d processing time: %.3fs'%sum(tscipy),tscipy
@@ -319,55 +274,17 @@ class TestConvOp(unittest.TestCase):
        d=N.asarray(tscipy)/tconvop
        print 'speed up ConvOp vs convolve2d: %.3f'%d.mean(),d

-    def test_multilayer_conv(self):
-        print '\n\n*************************************************'
-        print '           TEST MULTILAYER CONVOLUTION' 
-        print '*************************************************'
-
-        # fixed parameters
-        # test multiple configuration at the same time
-        bsizes = [6,6] # batch size
-        imshp_starts = [(1,13,14),(1,4,5)]
-        kshpss = ([[5,6],[7,4]],[[2,2],[2,2]])
-        nkernss = [[20,40],[2,2]] # per output pixel
-        ssizess = [[(1,1),(1,2)],[(1,1),(2,2)]]
-        convmodes = ['valid','full']
-        do_convolve2=True
-        unroll = [(0,0,True),(0,0,False),(1,1,False),(2,2,False),(3,2,False)]#(batch,kern,patch)
-        do_speed_test = False
-
-        # TODO: this version show a bug that was fixed
-        # the test is included in the upper test.
-#        imshp_start = (1,4,4)
-#        kshps = ([2,2],[2,2])#,[7,4])
-#        nkerns = [2,2] # per output pixel
-#        ssizes = [(1,1),(2,2)]#2,2)]
-
-#        bsizes = [1,1] # batch size
-#        imshp_starts = [(1,10,10),(1,5,6)]
-#        kshpss = ([[2,3],[3,2]],[[2,2],[2,2]])
-#        nkernss = [[1,1],[1,1]] # per output pixel
-
-        N.set_printoptions(threshold=N.nan)
-
-        # symbolic stuff
-        kerns = [T.matrix(),T.dmatrix()]
-        img = T.dmatrix()
-        rng = N.random.RandomState(3423489)
-        tctot, tpytot, ntot = [], [], []
-        for i in range(len(kshpss)):
-            assert len(kshpss[i])==len(nkernss[i])==len(kerns)
-
-        if do_speed_test:
+    def speed_multilayer_conv(self):
            # calculate the speed up of different combination of unroll
            # put the paramter to the same you will try. 
            
            validate=False# we don't validate the result to have it much faster!
-
+            verbose=1
            unroll_batch = [1,2,4,5,10,20]
            unroll_kern = [1,2,4,5,10,20]
            unroll_batch = [1,4,5]
            unroll_kern = [1,4,5]
+            unroll_patch = [True, False]
            
            bsize = 20 # batch size
            imshp_start = (1,48,48)#un square shape to test more corner case.
@@ -381,15 +298,16 @@ class TestConvOp(unittest.TestCase):

            assert len(kshps)==len(nkerns)==len(kerns)
        
-            timing = N.zeros((len(unroll_batch),len(unroll_kern),3))
+            timing = N.zeros((len(unroll_batch),len(unroll_kern),3,len(convmodes)*len(ssizes)))
            t_b_k=[]
            #calculate the timing with unrolling

+            print 'time unroll batch kern'
            t_=[[ 7.60572791,  3.95069814,  3.74271464], [ 4.05631089,  2.90384555,  2.93613672], [ 3.90551591,  2.92595196,  3.00102282]]
-            best=[]
-            worst=[]
            best=[0.52690219879150391, 2.4266397953033447]
            worst=[0.92042708396911621, 6.8822150230407715]
+            best=[]
+            worst=[]
            t_=[]
            for unroll_b, n_b in zip(unroll_batch,range(len(unroll_batch))):
                for unroll_k, n_k in zip(unroll_kern,range(len(unroll_kern))):
@@ -398,30 +316,31 @@ class TestConvOp(unittest.TestCase):
                        tctot, tpytot, ntot=[],[],[]
                        for conv_mode, n_mode in zip(convmodes,range(len(convmodes))):
                            for ss, n_ss in zip(ssizes,range(len(ssizes))):
-                                tctot_, tpytot_, ntot_ = exec_multilayer_conv_nnet(conv_mode, ss, bsize, imshp_start, kshps, nkerns, unroll_batch=unroll_b, unroll_kern=unroll_k, validate=validate)
+                                tctot_, tpytot_, ntot_ = exec_multilayer_conv_nnet(conv_mode, ss, bsize, imshp_start, kshps, nkerns, unroll_batch=unroll_b, unroll_kern=unroll_k, validate=validate, verbose=verbose,do_print=False)
                                tctot+=[tctot_]
                                tpytot+=[tpytot_]
                                ntot+=[ntot_]
                        if unroll_b==4 and unroll_k==4:
-                            print "unroll 4/4",tctot
+                            #print "unroll 4/4",tctot
                            best=tctot
                        if unroll_b==1 and unroll_k==1:
-                            print "unroll 1/1",tctot
+                            #print "unroll 1/1",tctot
                            worst=tctot
-                        timing[n_b,n_k]=[sum(tctot), sum(tpytot), sum(ntot)]
+                        timing[n_b,n_k]=[tctot, tpytot, ntot]#[sum(tctot), sum(tpytot), sum(ntot)]
            if not t_:
-                t=timing[:,:,0]#We select only the c timing.
+                t=timing[:,:,0,:]#We select only the c timing.
            else:
                t=t_
            t=N.asarray(t)
            #calculate the old timing
+            print 'time old version'
            tctot_=[0.52555489540100098, 6.6634182929992676]
-#            tctot_=[]
            tctot,tpytot,ntot=[],[],[]
+            tctot_=[]
            if not tctot_:
                for conv_mode, n_mode in zip(convmodes,range(len(convmodes))):
                    for ss, n_ss in zip(ssizes,range(len(ssizes))):
-                        tctot_, tpytot_, ntot_ = exec_multilayer_conv_nnet(conv_mode, ss, bsize, imshp_start, kshps, nkerns, unroll_batch=0, unroll_kern=0, validate=validate)
+                        tctot_, tpytot_, ntot_ = exec_multilayer_conv_nnet(conv_mode, ss, bsize, imshp_start, kshps, nkerns, unroll_batch=0, unroll_kern=0, validate=validate, verbose=verbose,do_print=False)
                        tctot+=[tctot_]
                        tpytot+=[tpytot_]
                        ntot+=[ntot_]
@@ -432,29 +351,73 @@ class TestConvOp(unittest.TestCase):
            print "timing for unrolled version"
            print t_b_k
            print t
+            t_detail=t
+            t = t.sum(axis=2)
            print "max %.3fs"%t.max(), "max param(batch unloop size/kernel unloop size)", t_b_k[t.argmax()]
            print "min %.3fs"%t.min(), "min param(batch unloop size/kernel unloop size)", t_b_k[t.argmin()]
            print "speedup vs (1/1)%.3fx, vs old %.3fx"% (t.max()/t.min(),sum(tctot)/t.min())
            print worst/best,tctot/best

+            #calculate the timing of unroll_patch
+            print 'time unroll_patch'
            tctot_patch = []
+            tctot_patch_size = []
            for conv_mode, n_mode in zip(convmodes,range(len(convmodes))):
                for ss, n_ss in zip(ssizes,range(len(ssizes))):
-                     tctot_, tpytot_, ntot_ = exec_multilayer_conv_nnet(conv_mode, ss, bsize, imshp_start, kshps, nkerns, unroll_batch=0, unroll_kern=0, validate=validate,unroll_patch=2)
-                     tctot_patch += [tctot_]
+                    tctot_, tpytot_, ntot_ = exec_multilayer_conv_nnet(conv_mode, ss, bsize, imshp_start, kshps, nkerns, unroll_batch=0, unroll_kern=0, validate=validate,unroll_patch=True,verbose=verbose,do_print=False)
+                    tctot_patch += [tctot_]
+                    tctot_, tpytot_, ntot_ = exec_multilayer_conv_nnet(conv_mode, ss, bsize, imshp_start, kshps, nkerns, unroll_batch=0, unroll_kern=0, validate=validate,unroll_patch=True,verbose=verbose,do_print=False,unroll_patch_size=True)
+                    tctot_patch_size += [tctot_]

            t_patch=sum(tctot_patch)
-            print "unroll_patch time", tctot_patch
+            print "unroll_patch without shape time", tctot_patch
            print "speedup vs (1/1)%.3fx, vs old %.3fx"% (t.max()/t_patch,sum(tctot)/t_patch)
            print best/tctot_patch, worst/tctot_patch
+            t_patch_size=sum(tctot_patch_size)
+            print "unroll_patch with shape time", tctot_patch_size
+            print "speedup vs (1/1)%.3fx, vs old %.3fx"% (t.max()/t_patch_size,sum(tctot)/t_patch_size)
+            print best/tctot_patch_size, worst/tctot_patch_size
            
-            print best
-            print worst
-            print tctot
-            print tctot_patch
            return

-        
+    def test_multilayer_conv(self):
+        print '\n\n*************************************************'
+        print '           TEST MULTILAYER CONVOLUTION' 
+        print '*************************************************'
+
+        # fixed parameters
+        # test multiple configuration at the same time
+        bsizes = [6,6] # batch size
+        imshp_starts = [(1,13,14),(1,4,5)]
+        kshpss = ([[5,6],[7,4]],[[2,2],[2,2]])
+        nkernss = [[20,40],[2,2]] # per output pixel
+        ssizess = [[(1,1),(1,2)],[(1,1),(2,2)]]
+        convmodes = ['valid','full']
+        do_convolve2=True
+        unroll = [(0,0,True),(0,0,False),(1,1,False),(2,2,False),(3,2,False)]#(batch,kern,patch)
+
+        # TODO: this version show a bug that was fixed
+        # the test is included in the upper test.
+#        imshp_start = (1,4,4)
+#        kshps = ([2,2],[2,2])#,[7,4])
+#        nkerns = [2,2] # per output pixel
+#        ssizes = [(1,1),(2,2)]#2,2)]
+
+#        bsizes = [1,1] # batch size
+#        imshp_starts = [(1,10,10),(1,5,6)]
+#        kshpss = ([[2,3],[3,2]],[[2,2],[2,2]])
+#        nkernss = [[1,1],[1,1]] # per output pixel
+
+        N.set_printoptions(threshold=N.nan)
+
+        # symbolic stuff
+        kerns = [T.matrix(),T.dmatrix()]
+        img = T.dmatrix()
+        rng = N.random.RandomState(3423489)
+        tctot, tpytot, ntot = [], [], []
+        for i in range(len(kshpss)):
+            assert len(kshpss[i])==len(nkernss[i])==len(kerns)
+
        for i in range(len(kshpss)):
            for conv_mode, n_mode in zip(convmodes,range(len(convmodes))):
                for ss, n_ss in zip(ssizess[i],range(len(ssizess[i]))):

--- a/theano/sandbox/test_downsample.py
+++ b/theano/sandbox/test_downsample.py
 import unittest, sys, time
-import numpy as N
-import theano.tensor as T
+import numpy
+import theano.tensor as tensor
 from theano.tests import unittest_tools as utt
-from theano.sandbox.downsample import DownsampleFactorMax
+from theano.sandbox.downsample import DownsampleFactorMax, max_pool2D
 from theano import function, Mode

-def max_pool(images=None, imshp=None, maxpoolshp=None, ignore_border=True):
-    """Implements a max pooling layer

-    Uses the same API as sp.max_pool but uses the Downsample op instead.
+class TestDownsampleFactorMax(unittest.TestCase):
+    def setUp(self):
+        utt.seed_rng()

-    Takes as input a 2D tensor of shape batch_size x img_size and performs max pooling.
-    Max pooling downsamples by taking the max value in a given area, here defined by
-    maxpoolshp. Outputs a 2D tensor of shape batch_size x output_size.
+    @staticmethod
+    def numpy_max_pool2D(input, ds, ignore_border=False):
+        '''Helper function, implementing max_pool2D in pure numpy'''
+        if len(input.shape) < 2:
+            raise NotImplementedError('input should have at least 2 dim, shape is %s'\
+                    % str(input.shape))

-    Parameters are keyword arguments in order to use func_to_mod.
+        xi=0
+        yi=0
+        if not ignore_border:
+            if input.shape[-2] % ds[0]:
+                xi += 1
+            if input.shape[-1] % ds[1]:
+                yi += 1

-    @param images: 2D tensor containing images on which to apply convolution.
-                   Assumed to be of shape batch_size x img_size
-    @param imgshp: tuple containing image dimensions
-    @param maxpoolshp: tuple containing shape of area to max pool over
-    
-    @output out1: symbolic result (2D tensor)
-    @output out2: logical shape of the output
+        out_shp = list(input.shape[:-2])
+        out_shp.append(input.shape[-2]/ds[0]+xi)
+        out_shp.append(input.shape[-1]/ds[1]+yi)

-    """
-    if len(imshp) == 2:
-        imshp = (1,) + imshp
-    elif len(imshp)!=3:
-        raise NotImplementedError("!")
-    
-    # all these reshapes should happen in place
-    imrshp = T.stack(images.shape[0],
-                          *[T.as_tensor(x) for x in imshp])
-    imtensor = T.reshape(images, imrshp)
+        output_val = numpy.zeros(out_shp)

-    maxpop = DownsampleFactorMax(maxpoolshp, ignore_border)
-    rval = maxpop(imtensor)
+        for k in numpy.ndindex(input.shape[:-2]):
+            for i in range(output_val.shape[-2]):
+                ii =  i*ds[0]
+                for j in range(output_val.shape[-1]):
+                    jj = j*ds[1]
+                    patch = input[k][ii:ii+ds[0],jj:jj+ds[1]]
+                    output_val[k][i,j] = numpy.max(patch)
+        return output_val

-    return T.flatten(rval,2), maxpop.out_shape(imshp, maxpoolshp, ignore_border)
+    def test_DownsampleFactorMax(self):
+        rng = numpy.random.RandomState(utt.fetch_seed())

-class TestDownsampleFactorMax(unittest.TestCase):
-    def test_maxpool(self):
-        # generate flatted images
+        # generate random images
        maxpoolshps = ((1,1),(2,2),(3,3),(2,3))
-        imval = N.random.rand(4,10,64,64)
-        images = T.dmatrix()
-        dmatrix4=T.TensorType('float64', (False, False, False, False))
-        images4=dmatrix4()
-        tctot, tpytot, ntot = [],[],[]
+        imval = rng.rand(4,10,64,64)
+        images = tensor.dtensor4()
+
        for maxpoolshp in maxpoolshps:
-            for border in [True,False]:
-                print 'maxpoolshp', maxpoolshp,'border', border
-           
-                # numeric verification
-                xi=0
-                yi=0
-                if not border:
-                    if imval.shape[-2] % maxpoolshp[0]:
-                        xi += 1
-                    if imval.shape[-1] % maxpoolshp[1]:
-                        yi += 1
-                my_output_val = N.zeros((imval.shape[0], imval.shape[1],
-                                         imval.shape[2]/maxpoolshp[0]+xi,
-                                         imval.shape[3]/maxpoolshp[1]+yi))
-            
-                time1=time.time()
-                for n in range(imval.shape[0]):
-                    for k in range(imval.shape[1]):
-                        for i in range(my_output_val.shape[2]):
-                            ii =  i*maxpoolshp[0]
-                            for j in range(my_output_val.shape[3]):
-                                jj = j*maxpoolshp[1]
-                                patch = imval[n,k,ii:ii+maxpoolshp[0],jj:jj+maxpoolshp[1]]
-                                my_output_val[n,k,i,j] = N.max(patch)
-                my_output_val = my_output_val.reshape(imval.shape[0],-1)
-                ntot+=[time.time()-time1]
-
-                # symbolic stuff
-            #### wrapper to DownsampleFactorMax op ####
-                output, outshp = max_pool(images, imval.shape[1:], maxpoolshp, border)
-                assert N.prod(my_output_val.shape[1:]) == N.prod(outshp)
-                assert N.prod(my_output_val.shape[1:]) == N.prod(outshp)
+            for ignore_border in [True,False]:
+                print 'maxpoolshp =', maxpoolshp
+                print 'ignore_border =', ignore_border
+
+                ## Pure Numpy computation
+                numpy_output_val = self.numpy_max_pool2D(imval, maxpoolshp, ignore_border)
+
+                output = max_pool2D(images, maxpoolshp, ignore_border)
                f = function([images,],[output,])
-                imval2=imval.reshape(imval.shape[0],-1)
-                output_val = f(imval2)
-                assert N.all(output_val == my_output_val)
-                
+                output_val = f(imval)
+                assert numpy.all(output_val == numpy_output_val)
+
                #DownsampleFactorMax op
-                maxpool_op = DownsampleFactorMax(maxpoolshp, ignore_border=border)(images4)
-                f = function([images4],maxpool_op,mode=Mode(linker="py"))
-                f2 = function([images4],maxpool_op,mode=Mode(linker="c"))
-                f3 = function([images4],maxpool_op)#for when we want to use the debug mode
-                time1=time.time()
+                maxpool_op = DownsampleFactorMax(maxpoolshp, ignore_border=ignore_border)(images)
+                f = function([images], maxpool_op)
                output_val = f(imval)
-                tctot+=[time.time()-time1]
-                assert (N.abs(my_output_val.flatten()-output_val.flatten())<1e-5).all()
-                time1=time.time()
-                output_val = f2(imval)
-                tpytot+=[time.time()-time1]
-                assert (N.abs(my_output_val.flatten()-output_val.flatten())<1e-5).all()
-                output_val = f3(imval)
-
-        print 'Numpy processing time: %.3fs'%sum(ntot),ntot
-        print 'c Theano(DownsampleFactorMax) processing time: %.3fs'%sum(tctot),tctot
-        print 'py Theano(DownsampleFactorMax) processing time: %.3fs'%sum(tpytot),tpytot
-        d=N.asarray(ntot)/tctot
-        print 'speed up c theano(DownsampleFactorMax) vs manual: %.3f'%d.mean(),d
-        d=N.asarray(ntot)/tpytot
-        print 'speed up py theano(DownsampleFactorMax) vs manual: %.3f'%d.mean(),d
+                assert (numpy.abs(output_val - numpy_output_val) < 1e-5).all()

    def test_DownsampleFactorMax_grad(self):
-        # generate flatted images
+        rng = numpy.random.RandomState(utt.fetch_seed())
        maxpoolshps = ((1,1),(3,2),(2,3))
-        imval = N.random.rand(2,3,3,4) * 10.0 #more variance means numeric gradient will be more accurate
-        do_theano=True
+        imval = rng.rand(2,3,3,4) * 10.0 #more variance means numeric gradient will be more accurate
+
+        for maxpoolshp in maxpoolshps:
+            for ignore_border in [True,False]:
+                print 'maxpoolshp =', maxpoolshp
+                print 'ignore_border =', ignore_border
+                def mp(input):
+                    return DownsampleFactorMax(maxpoolshp, ignore_border=ignore_border)(input)
+                utt.verify_grad(mp, [imval], rng=rng)
+
+    def test_max_pool2D_2D(self):
+        rng = numpy.random.RandomState(utt.fetch_seed())
+
+        maxpoolshps = ((1,1),(3,2))
+        imval = rng.rand(4,7)
+        images = tensor.dmatrix()
+
+        for maxpoolshp in maxpoolshps:
+            for ignore_border in [True,False]:
+                print 'maxpoolshp =', maxpoolshp
+                print 'ignore_border =', ignore_border
+                numpy_output_val = self.numpy_max_pool2D(imval, maxpoolshp, ignore_border)
+
+                output = max_pool2D(images, maxpoolshp, ignore_border)
+                output_val = function([images], output)(imval)
+                assert numpy.all(output_val == numpy_output_val)
+
+                def mp(input):
+                    return max_pool2D(input, maxpoolshp, ignore_border)
+                utt.verify_grad(mp, [imval], rng=rng)
+
+    def test_max_pool2D_3D(self):
+        rng = numpy.random.RandomState(utt.fetch_seed())
+
+        maxpoolshps = [(1,2)]
+        imval = rng.rand(2,3,4)
+        images = tensor.dtensor3()
+
        for maxpoolshp in maxpoolshps:
-            for border in [True,False]:
-                print 'maxpoolshp', maxpoolshp, 'border', border
+            for ignore_border in [True,False]:
+                print 'maxpoolshp =', maxpoolshp
+                print 'ignore_border =', ignore_border
+                numpy_output_val = self.numpy_max_pool2D(imval, maxpoolshp, ignore_border)
+
+                output = max_pool2D(images, maxpoolshp, ignore_border)
+                output_val = function([images], output)(imval)
+                assert numpy.all(output_val == numpy_output_val)
+
+                c = tensor.sum(output)
+                c_val = function([images], c)(imval)
+
+                g = tensor.grad(c, images)
+                g_val = function([images], [g.shape, tensor.min(tensor.min(tensor.min(g))), tensor.max(tensor.max(tensor.max(g)))])(imval)
+
                def mp(input):
-                    return DownsampleFactorMax(maxpoolshp, ignore_border=border)(input)
-                utt.verify_grad(mp, [imval])
+                    return max_pool2D(input, maxpoolshp, ignore_border)
+                utt.verify_grad(mp, [imval], rng=rng)
+
+
+    def test_max_pool2D_6D(self):
+        rng = numpy.random.RandomState(utt.fetch_seed())
+
+        maxpoolshps = [(3,2)]
+        imval = rng.rand(2,1,1,1,3,4)
+        images = tensor.TensorType('float64', [False]*6)()
+
+        for maxpoolshp in maxpoolshps:
+            for ignore_border in [True,False]:
+                print 'maxpoolshp =', maxpoolshp
+                print 'ignore_border =', ignore_border
+                numpy_output_val = self.numpy_max_pool2D(imval, maxpoolshp, ignore_border)
+
+                output = max_pool2D(images, maxpoolshp, ignore_border)
+                output_val = function([images], output)(imval)
+                assert numpy.all(output_val == numpy_output_val)
+
+                def mp(input):
+                    return max_pool2D(input, maxpoolshp, ignore_border)
+                utt.verify_grad(mp, [imval], rng=rng)
+
+

 if __name__ == '__main__':
-    t = TestDownsampleFactorMax("test_maxpool").run()
-    #t.test_maxpool()
-    from theano.tests import main
-#    main("test_sp")
+    unittest.main()
--- a/theano/scalar/basic.py
+++ b/theano/scalar/basic.py
@@ -1125,7 +1125,7 @@ inv = Inv(upgrade_to_float, name = 'inv')
 class Log(UnaryScalarOp):
    """ log base e """
    def impl(self, x):
-        return math.log(x)
+        return numpy.log(x)
    def grad(self, (x, ), (gz, )):
      if x.type in grad_types:
        return gz / x,

--- a/theano/tensor/basic.py
+++ b/theano/tensor/basic.py
@@ -330,6 +330,7 @@ class TensorType(Type):
        self.broadcastable = tuple(broadcastable)
        self.dtype_specs() # error checking is done there
        self.name = name
+        self.numpy_dtype = numpy.dtype(self.dtype)
        if shape is None:
          #backport self.shape = tuple((1 if b else None) for b in self.broadcastable)
            l=[]
@@ -360,16 +361,16 @@ class TensorType(Type):
        This function is not meant to be called in user code.  It is for
        `Linker` instances to use when running a compiled graph.
        """
-        _data = data
-        if strict:
+        if (type(data) is numpy.ndarray) and (data.dtype is self.numpy_dtype):
+            pass # fall through to ndim check
+        elif strict:
+            # this is its own subcase that doesn't fall through to anything
            if not isinstance(data, numpy.ndarray):
                raise TypeError("%s expected a ndarray object.", data, type(data))
            if not str(data.dtype) == self.dtype:
                raise TypeError("%s expected a ndarray object with dtype = %s (got %s)." % (self, self.dtype, data.dtype))
            if not data.ndim == self.ndim:
                raise TypeError("%s expected a ndarray object with %s dimensions (got %s)." % (self, self.ndim, data.ndim))
-            if self.filter_checks_isfinite and (not numpy.all(numpy.isfinite(data))):
-                raise TypeError("non-finite elements not allowed")

            if TensorType.use_shape:
                for si, di in zip(self.shape, data.shape):
@@ -378,11 +379,17 @@ class TensorType(Type):
                            self, self.shape, data.shape))
            return data
        else:
-            data = theano._asarray(data, dtype = self.dtype)
-        if not self.ndim == data.ndim:
+            data = theano._asarray(data, dtype = self.dtype) #TODO - consider to pad shape with ones
+            # to make it consistent with self.broadcastable... like vector->row type thing
+        if self.ndim != data.ndim:
            raise TypeError("Wrong number of dimensions: expected %s, got %s with shape %s." % (self.ndim, data.ndim, data.shape), data)
-        if any(b and d != 1 for d, b in zip(data.shape, self.broadcastable)):
-            raise TypeError("Non-unit value on shape on a broadcastable dimension.", data.shape, self.broadcastable)
+        i = 0
+        for b in self.broadcastable:
+            if b and data.shape[i] != 1:
+                raise TypeError("Non-unit value on shape on a broadcastable dimension.", data.shape, self.broadcastable)
+            i+=1
+        if self.filter_checks_isfinite and (not numpy.all(numpy.isfinite(data))):
+            raise ValueError("non-finite elements not allowed")
        return data

    def dtype_specs(self):
@@ -1826,14 +1833,16 @@ class Default(gof.Op):
    view_map = {0: [0]}
    def make_node(self, x, default):
        x, default = as_tensor_variable(x), as_tensor_variable(default)
-        assert x.type == default.type
+        if  x.type != default.type:
+            raise TypeError('Both default() arguments must have same type', x, default)
        return gof.Apply(self, [x, default], [default.type()])
    def perform(self, node, (x, default), (out, )):
-      if x is None:
-        out[0] = default.copy()
-      else:
-        out[0] = x
-      #backport out[0] = default.copy() if x is None else x
+        if x is None:
+            # why copy?  Theano can't yet understand out[0] being a view of either x or y,
+            # so we can be a view of x, but only a copy of y.
+            out[0] = default.copy() 
+        else:
+            out[0] = x
 default = Default()
 setdefault = default # legacy

@@ -3588,8 +3597,10 @@ def verify_grad(op, pt, n_tests=2, rng=None, eps=None, tol=None, mode=None, cast

        o_fn = function(tensor_pt, o_output)
        o_fn_out = o_fn(*[p.copy() for p in pt])
-        
-        random_projection = rng.rand(*o_fn_out.shape)
+
+        # random_projection should not have elements too small,
+        # otherwise too much precision is lost in numerical gradient
+        random_projection = rng.rand(*o_fn_out.shape) + 0.5
        if cast_to_output_type:
            random_projection = numpy.array(random_projection,
                                            dtype=o_output.dtype)

--- a/theano/tensor/elemwise.py
+++ b/theano/tensor/elemwise.py
@@ -822,7 +822,14 @@ class CAReduce(Op):
        to_reduce = reversed(sorted(axis))
        if to_reduce:
            for dimension in to_reduce:
-                variable = self.ufunc.reduce(variable, dimension)
+                # If it's a zero-size array, use scalar_op.identity if available
+                if variable.shape[dimension] == 0:
+                    if hasattr(self.scalar_op, 'identity'):
+                        variable = self.scalar_op.identity
+                    else:
+                        raise ValueError("Input (%s) has zero-size on axis %s, but self.scalar_op (%s) has no attribute 'identity'" % (variable, dimension, self.scalar_op))
+                else:
+                    variable = self.ufunc.reduce(variable, dimension)
            output[0] = theano._asarray(variable, dtype = node.outputs[0].type.dtype)
        else:
            output[0] = numpy.copy(variable)

--- a/theano/tensor/tests/test_elemwise.py
+++ b/theano/tensor/tests/test_elemwise.py
@@ -133,6 +133,8 @@ class test_CAReduce(unittest.TestCase):
                           ((5, 6), (1, )),
                           ((5, 6), ()),
                           ((2, 3, 4, 5), (0, 1, 3)),
+                           ((5, 0), (0, )),
+                           ((5, 0), (1, )),
                           ((), ())]:
            x = TensorType('float64', [(entry == 1) for entry in xsh])('x')
            e = CAReduce(add, axis = tosum)(x)
@@ -149,7 +151,7 @@ class test_CAReduce(unittest.TestCase):

    def test_c(self):
        self.with_linker(gof.CLinker())
-        
+

 if __name__ == '__main__':
    unittest.main()