merge

b4c881d1 · Dumitru Erhan · a25706e8 · 86ec00c0 · b4c881d1 · b4c881d1
--- a/doc/library/config.txt
+++ b/doc/library/config.txt
@@ -43,8 +43,10 @@ Environment Variables

 .. envvar:: THEANO_FLAGS

-    This is a list of comma-delimited key[=value] pairs that control Theano's behavior. A key that appears without an '=value' must be for a boolean value, and it acts as setting it to True.
-    
+    This is a list of comma-delimited key[=value] pairs that control
+    Theano's behavior. A key that appears without an '=value' must be
+    for a boolean value, and it acts as setting it to True.
+
    For example, in bash, you can override your :envvar:`THEANORC` defaults
    for <myscript>.py by typing this:

@@ -52,11 +54,15 @@ Environment Variables

        THEANO_FLAGS='floatX=float32,device=gpu0,nvcc.fastmath'  python <myscript>.py

+    If a value is defined several times in ``THEANO_FLAGS``,
+    the right-most definition is used. So, for instance, if
+    ``THEANO_FLAGS='device=cpu,device=gpu0'``, then gpu0 will be used.
+
 .. envvar:: THEANORC

    The location[s] of the .theanorc file[s] in ConfigParser format.
-    It defaults to ``$HOME/.theanorc``. 
-    
+    It defaults to ``$HOME/.theanorc``.
+
    Here is the .theanorc equivalent to the THEANO_FLAGS in the example above:

    .. code-block:: text
@@ -70,10 +76,10 @@ Environment Variables

    Multiple configuration files can be specified by separating them with ':'
    characters (as in $PATH).  Multiple configuration files will be merged,
-    with earlier (left-most) files taking priority over later files in the
+    with later (right-most) files taking priority over earlier files in the
    case that multiple files specify values for a common configuration option.
-    For example, to override system-wide settings with personal ones, 
-    set ``THEANORC=~/.theanorc:/etc/theanorc``
+    For example, to override system-wide settings with personal ones,
+    set ``THEANORC=/etc/theanorc:~/.theanorc``.

 The rest of this page describes some of the more common and important flags
 that you might want to use.  For the complete list (including documentation),

--- a/doc/tutorial/using_gpu.txt
+++ b/doc/tutorial/using_gpu.txt
@@ -58,7 +58,7 @@ file and run it.
    import numpy
    import time

-    vlen = 100000
+    vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
    iters = 1000

    rng = numpy.random.RandomState(22)
@@ -74,28 +74,31 @@ The program just computes the exp() of a bunch of random numbers.
 Note that we use the `shared` function to
 make sure that the input `x` are stored on the graphics device.

-If I run this program (in thing.py) with device=cpu, my computer takes a little over 3 seconds, whereas on the GPU it takes just over 0.2 seconds.  Note that the results are close but not identical!  The GPU will not always produce the exact same floating-point numbers as the CPU.
+If I run this program (in thing.py) with device=cpu, my computer takes a little over 7 seconds,
+whereas on the GPU it takes just over 0.4 seconds.  Note that the results are close but not
+identical!  The GPU will not always produce the exact same floating-point numbers as the CPU.
+As a point of reference, a loop that calls ``numpy.exp(x.value)`` also takes about 7 seconds.

 .. code-block:: text

    $ THEANO_FLAGS=mode=FAST_RUN,device=cpu python thing.py 
-    Looping 100 times took 3.12647008896 seconds
-    Result is [ 1.23178032  1.61879341  1.52278065 ...,  1.74085572  2.55530456 1.88906098]
+    Looping 100 times took 7.17374897003 seconds
+    Result is [ 1.23178032  1.61879341  1.52278065 ...,  2.20771815  2.29967753 1.62323285]

    bergstra@tikuanyin:~/tmp$ THEANO_FLAGS=mode=FAST_RUN,device=gpu0 python thing.py 
    Using gpu device 0: GeForce GTX 285
-    Looping 100 times took 0.217401981354 seconds
-    Result is [ 1.23178029  1.61879349  1.52278066 ...,  1.74085569 2.55530477 1.88906097]
+    Looping 100 times took 0.418929815292 seconds
+    Result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761 1.62323296]

 Returning a handle to device-allocated data
 -------------------------------------------

 The speedup is not greater in the example above because the function is
-returning its result as a numpy ndarray (which has already copied from the
-device to the host).  This is what makes it so easy to swap in device=gpu0, but
-if you want to be less portable, you can see a bigger speedup by changing
+returning its result as a numpy ndarray which has already been copied from the
+device to the host for your convenience.  This is what makes it so easy to swap in device=gpu0, but
+if you don't mind being less portable, you might prefer to see a bigger speedup by changing
 the graph to express a computation with a GPU-stored result.  The gpu_from_host
-op means "copy the input from the host to the gpu" and it is optimized away
+Op means "copy the input from the host to the gpu" and it is optimized away
 after the T.exp(x) is replaced by a GPU version of exp().

 .. code-block:: python
@@ -105,7 +108,7 @@ after the T.exp(x) is replaced by a GPU version of exp().
    import numpy
    import time

-    vlen = 100000
+    vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
    iters = 1000

    rng = numpy.random.RandomState(22)
@@ -123,17 +126,71 @@ The output from this program is
 .. code-block:: text

    Using gpu device 0: GeForce GTX 285
-    Looping 100 times took 0.173671007156 seconds
+    Looping 100 times took 0.185714006424 seconds
    Result is <CudaNdarray object at 0x3e9e970>
-    Numpy result is [ 1.23178029  1.61879349  1.52278066 ...,  1.74085569 2.55530477 1.88906097]
+    Numpy result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761 1.62323296]

-Here we've shaved off about 20% of the run-time by simply not copying the
+Here we've shaved off about 50% of the run-time by simply not copying the
 resulting array back to the host.
 The object returned by each function call is now not a numpy array but a
 "CudaNdarray" which can be converted to a numpy ndarray by the normal
 numpy casting mechanism.


+Running the GPU at Full Speed
+------------------------------
+
+To really get maximum performance in this simple example, we need to use an :class:`Out`
+instance to tell Theano not to copy the output it returns to us.  Theano allocates memory for
+internal use like a working buffer, but by default it will never return a result that is
+allocated in the working buffer.  This is normally what you want, but our example is so simple
+that it has the un-wanted side-effect of really slowing things down.
+
+.. 
+    TODO:
+    The story here about copying and working buffers is misleading and potentially not correct
+    ... why exactly does borrow=True cut 75% of the runtime ???
+
+.. code-block:: python
+
+    from theano import function, config, shared, sandbox, Out
+    import theano.tensor as T
+    import numpy
+    import time
+
+    vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
+    iters = 1000
+
+    rng = numpy.random.RandomState(22)
+    x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
+    f = function([], 
+            Out(sandbox.cuda.basic_ops.gpu_from_host(T.exp(x)),
+                borrow=True))
+    t0 = time.time()
+    for i in xrange(iters):
+        r = f()
+    print 'Looping 100 times took', time.time() - t0, 'seconds'
+    print 'Result is', r
+    print 'Numpy result is', numpy.asarray(r)
+
+Running this version of the code takes just under 0.05 seconds, over 140x faster than
+the CPU implementation!
+
+.. code-block:: text
+
+    Using gpu device 0: GeForce GTX 285
+    Looping 100 times took 0.0497219562531 seconds
+    Result is <CudaNdarray object at 0x31eeaf0>
+    Numpy result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761 1.62323296]
+
+This version of the code ``using borrow=True`` is slightly less safe because if we had saved
+the `r` returned from one function call, we would have to take care and remember that its value might
+be over-written by a subsequent function call.  Although borrow=True makes a dramatic difference in this example,
+be careful!  The advantage of
+borrow=True is much weaker in larger graphs, and there is a lot of potential for making a
+mistake by failing to account for the resulting memory aliasing.
+
+
 What can be accelerated on the GPU?
 ------------------------------------


--- a/theano/compile/function_module.py
+++ b/theano/compile/function_module.py
@@ -428,9 +428,20 @@ class Function(object):
        # Reinitialize each container's 'provided' counter
        for c in self.input_storage:
            c.provided = 0
+
        # Set positional arguments
-        for i, arg in enumerate(args):
-            self[i] = arg
+        i = 0
+        for arg in args:
+            #TODO: provide a Param option for skipping the filter if we
+            #      really want speed.
+            s = self.input_storage[i]
+            if arg is None:
+                s.storage[0] = arg
+            else:
+                s.storage[0] = s.type.filter(arg, strict=s.strict)
+            s.provided += 1
+            i+=1
+
        # Set keyword arguments
        for k, arg in kwargs.iteritems():
            self[k] = arg
@@ -448,7 +459,9 @@ class Function(object):
                            self.inv_finder[c]))

        # Do the actual work
+        t0_fn = time.time()
        self.fn()
+        dt_fn = time.time() - t0_fn

        # Retrieve the values that were computed
        outputs = [x.data for x in self.output_storage]
@@ -486,6 +499,9 @@ class Function(object):
          self.maker.mode.fct_call_time[self.name] += dt_call
          self.maker.mode.fct_call[self.name] += 1

+        self.maker.mode.call_time += dt_call
+        self.maker.mode.fn_time += dt_fn
+        
        if self.return_none:
            return None
        elif self.unpack_single and len(outputs) == 1:

--- a/theano/compile/mode.py
+++ b/theano/compile/mode.py
@@ -172,6 +172,8 @@ class Mode(object):
        if isinstance(optimizer, gof.Query):
            self.provided_optimizer = optimizer
        self._optimizer = optimizer
+        self.call_time = 0
+        self.fn_time = 0

    def __str__(self):
        return "Mode(linker = %s, optimizer = %s)" % (self.provided_linker, self.provided_optimizer)

--- a/theano/compile/profilemode.py
+++ b/theano/compile/profilemode.py
 import time, atexit, copy

-from theano.gof.link import WrapLinkerMany
+from theano.gof.link import WrapLinker
 from theano.gof.cutils import run_cthunk
 from theano.compile.mode import Mode, register_mode, predefined_modes, predefined_linkers, predefined_optimizers, default_linker, default_optimizer
 from theano.gof.cc import OpWiseCLinker
 from theano.gof.python25 import any
 from theano import gof
 from theano.configparser import config, AddConfigVar, IntParam
+from theano.compile.function_module import FunctionMaker

 import_time = time.time()

@@ -18,44 +19,57 @@ AddConfigVar('ProfileMode.n_ops_to_print',
        "Number of ops to print by default",
        IntParam(20, lambda i: i > 0))

+class Profile_Maker(FunctionMaker):
+    def create(self, input_storage=None, trustme=False):
+        ret = super(Profile_Maker,self).create(input_storage, trustme)
+        for i, node in enumerate(ret.maker.env.toposort()):
+            self.mode.apply_time[(i,node.op)]=0.0
+            self.mode.apply_call[(i,node.op)]=0
+#            self.mode.op_cimpl[node.op] = 
+
+        return ret

 class ProfileMode(Mode):
    def __init__(self, linker=default_linker, optimizer=default_optimizer):
        local_time = [0.0]
        apply_time = {}
        apply_call = {}
-        op_time = {}
        op_cimpl = {}
-        op_call = {}
        compile_time = 0 #time passed in theano.function()
        fct_call_time = {}#time passed inside theano fct call including op time.
        fct_call = {}

        self.__setstate__((linker, optimizer, local_time,
                           apply_time, apply_call,
-                           op_time, op_cimpl, op_call, 
+                           op_cimpl,
                           compile_time, fct_call_time, fct_call))

+    def function_maker(self, i,o,m, *args, **kwargs):
+        """Return an instance of `Profiler_Maker` which init the count"""
+
+        assert m is self
+        return Profile_Maker(i, o, self, *args, **kwargs)
+
    def __getstate__(self):
        #print "__getstate__",self.provided_linker,self.provided_optimizer
        return (self.provided_linker, self.provided_optimizer, self.local_time,
                self.apply_time, self.apply_call,
-                self.op_time, self.op_cimpl, self.op_call, self.compile_time, self.fct_call_time, self.fct_call)
+                self.op_cimpl, self.compile_time, self.fct_call_time, self.fct_call)

    def __setstate__(self, (linker, optimizer, local_time,
                            apply_time, apply_call,
-                            op_time, op_cimpl, op_call, 
+                            op_cimpl,
                            compile_time, fct_call_time, fct_call)):
        
        self.local_time = local_time
        self.apply_time = apply_time
        self.apply_call = apply_call
-        self.op_time = op_time
        self.op_cimpl = op_cimpl
-        self.op_call = op_call
        self.compile_time = compile_time
        self.fct_call_time = fct_call_time
        self.fct_call = fct_call
+        self.call_time = 0
+        self.fn_time = 0

        def blah(i, node, th):
            if hasattr(th, 'cthunk'):
@@ -63,7 +77,7 @@ class ProfileMode(Mode):
                failure = run_cthunk(th.cthunk)
                dt = time.time() - t0
                if failure:
-                    raise RuntimeError(('A C Op raised an exception.  PerformLinker cannot' 
+                    raise RuntimeError(('A C Op raised an exception.  PROFILE_MODE cannot' 
                        ' tell you what it was though.  Use a standard mode such as'
                        ' FAST_RUN_NOGC to correct the problem.'))
            else:
@@ -72,11 +86,9 @@ class ProfileMode(Mode):
                dt = time.time() - t0

            local_time[0] += dt
-            apply_time[(i,node.op)] = apply_time.get((i,node.op), 0.0) + dt
-            apply_call[(i,node.op)] = apply_call.get((i,node.op), 0) + 1
-            op_time[node.op] = op_time.get(node.op, 0.0) + dt
+            apply_time[(i,node.op)] += dt
+            apply_call[(i,node.op)] += 1
            op_cimpl[node.op] = hasattr(th, 'cthunk')
-            op_call[node.op] = op_call.get(node.op,0) + 1

        
        self.provided_linker = linker
@@ -84,7 +96,7 @@ class ProfileMode(Mode):
        if isinstance(linker, str) or linker is None:
            linker = predefined_linkers[linker]

-        linker = WrapLinkerMany([linker], [blah])
+        linker = WrapLinker([linker], blah)
            
        self.linker = linker
        if isinstance(optimizer, str) or optimizer is None:
@@ -113,18 +125,11 @@ class ProfileMode(Mode):
        fct_call = self.fct_call
        apply_time = self.apply_time
        apply_call = self.apply_call
-        op_time = self.op_time
-        op_call = self.op_call
        op_cimpl = self.op_cimpl

-        op_flops = {}
-        for a,t in op_time.items():
-            if hasattr(a,'flops'):
-                op_flops[a]=a.flops*op_call[a]/t/1e6
-
        self.print_summary_("print_summary",local_time, compile_time, fct_call_time, fct_call,
-                            apply_time, apply_call, op_time, op_call, op_cimpl,
-                            op_flops, n_apply_to_print, n_ops_to_print)
+                            apply_time, apply_call, op_cimpl,
+                            n_apply_to_print, n_ops_to_print)


    def print_diff_summary(self, other, n_apply_to_print=15, n_ops_to_print=20):
@@ -153,42 +158,23 @@ class ProfileMode(Mode):
                r[a]+=t
            return r
        
-        def diff_dict_flops(a_time,b_time_,a_call,b_call):
-            flops = {}
-            b_time = copy.copy(b_time_)
-            for a,ta in a_time.items():
-                tb = b_time.pop(a,0)
-                if hasattr(a,'flops'):
-                    flops[a]=a.flops*a_call[a]/ta - a.flops*b_call[a]/tb/1e6
-                
-            #they are missing in a
-            for b,tb in b_time.items():
-                if hasattr(b,'flops'):
-                    flops[b]=b.flops*b_call[b]/tb/1e6
-
-            return flops
-
        local_time = self.local_time[0]-other.local_time[0]
        compile_time = self.compile_time-other.compile_time
        fct_call_time = diff_dict(self.fct_call_time,other.fct_call_time)
        fct_call = diff_dict(self.fct_call,other.fct_call)
        apply_time = diff_dict(self.apply_time, other.apply_time)
        apply_call = diff_dict(self.apply_call, other.apply_call)
-        op_time = diff_dict(self.op_time, other.op_time)
-        op_call = diff_dict(self.op_call, other.op_call)
        op_cimpl = self.op_cimpl and other.op_cimpl

-        op_flops = diff_dict_flops(self.op_time, other.op_time, self.op_call, other.op_call)
-        
        self.print_summary_("print_diff_summary",local_time, compile_time, fct_call_time, fct_call,
-                            apply_time, apply_call, op_time, op_call, op_cimpl,
-                            op_flops, n_apply_to_print=n_apply_to_print,
+                            apply_time, apply_call, op_cimpl,
+                            n_apply_to_print=n_apply_to_print,
                            n_ops_to_print=n_ops_to_print, print_apply=False)

    @staticmethod
    def print_summary_(fct_name, local_time, compile_time, fct_call_time, fct_call,
-                       apply_time, apply_call, op_time, op_call, op_cimpl,
-                       op_flops=None, n_apply_to_print=15, n_ops_to_print=20, print_apply=True):
+                       apply_time, apply_call, op_cimpl,
+                       n_apply_to_print=15, n_ops_to_print=20, print_apply=True):
        """
        do the actual printing of print_summary and print_diff_summary.

@@ -218,6 +204,19 @@ class ProfileMode(Mode):
                      sum(f for f, t, a, nb_call in atimes[n_apply_to_print:])*100,
                      sum(t for f, t, a, nb_call in atimes[n_apply_to_print:]))

+        op_time = {}
+        op_call = {}
+        for (i,a),t in apply_time.items():
+            op_time.setdefault(a,0)
+            op_call.setdefault(a,0)
+            op_time[a]+=t
+            op_call[a]+=apply_call[(i,a)]
+
+        op_flops = {}
+        for a,t in op_time.items():
+            if hasattr(a,'flops'):
+                op_flops[a]=a.flops*op_call[a]/t/1e6
+
        flops_msg=''
        if op_flops:
            flops_msg=' <MFlops/s>'

--- a/theano/compile/tests/test_debugmode.py
+++ b/theano/compile/tests/test_debugmode.py
@@ -544,35 +544,20 @@ class Test_check_isfinite(unittest.TestCase):
        theano.tensor.TensorType.filter_checks_isfinite = self.old_val

    def test_check_isfinite(self):
-        x = theano.tensor.dvector()
+        x = theano.tensor.vector()
        f = theano.function([x], (x+2) * 5, mode='DEBUG_MODE')
+        g = theano.function([x], theano.tensor.log(x), mode='DEBUG_MODE')

        # this should work
        f(numpy.log([3, 4, 5]))

-        # this should raise InvalidValueError
-        try:
-            # insert a NaN
-            f(numpy.log([3, -4, 5]))
-            assert False
-        except debugmode.InvalidValueError:
-            pass
-
-        # this should raise InvalidValueError
-        try:
-            # insert an Nan and Inf
-            f(numpy.asarray([0, 1.0, 0])/0)
-            assert False
-        except debugmode.InvalidValueError:
-            pass
+        # passing an invalid value as an input should trigger ValueError
+        self.failUnlessRaises(ValueError, f, numpy.log([3, -4, 5]))
+        self.failUnlessRaises(ValueError, f, numpy.asarray([0, 1.0, 0])/0)
+        self.failUnlessRaises(ValueError, f, numpy.asarray([1.0, 1.0, 1.0])/0)

-        # this should raise InvalidValueError
-        try:
-            # insert several Inf
-            f(numpy.asarray([1.0, 1.0, 1.0])/0)
-            assert False
-        except debugmode.InvalidValueError:
-            pass
+        # generating an invalid value internally should trigger InvalidValueError
+        self.failUnlessRaises(debugmode.InvalidValueError, g, [3,-4,5])

        # this should disable the exception
        theano.tensor.TensorType.filter_checks_isfinite = False

--- a/theano/configparser.py
+++ b/theano/configparser.py
@@ -14,11 +14,12 @@ THEANO_FLAGS=os.getenv("THEANO_FLAGS","")
 # [section.]option[=value] entries. If the section part is omited, their should be only one
 # section with that contain the gived option.

-# THEANORC=~/.theanorc:~lisa/.theanorc
+# THEANORC can contain a colon-delimited list of config files, like
+# THEANORC=~lisa/.theanorc:~/.theanorc
+# In that case, definitions in files on the right (here, ~/.theanorc) have
+# precedence over those in files on the left.
 def config_files_from_theanorc():
    rval = [os.path.expanduser(s) for s in os.getenv('THEANORC', '~/.theanorc').split(':')]
-    rval.reverse()
-    print "THEANORC", rval
    return rval
 theano_cfg = ConfigParser.SafeConfigParser()
 theano_cfg.read(config_files_from_theanorc())
@@ -42,14 +43,15 @@ def fetch_val_for_key(key):
    """Return the overriding config value for a key.
    A successful search returs a string value.
    An unsuccessful search raises a KeyError
-    
-    The priority order is:
+
+    The (decreasing) priority order is:
    - THEANO_FLAGS
    - ~./theanorc
-    
+
    """

    # first try to find it in the FLAGS
+    rval = None
    for name_val in THEANO_FLAGS.split(','):
        if not name_val:
            continue
@@ -60,7 +62,12 @@ def fetch_val_for_key(key):
            name, val = name_val_tuple

        if name == key:
-            return val
+            # rval might be overriden by a later definition in THEANO_FLAGS
+            rval = val
+
+    # If an rval is found, it should be a string
+    if rval is not None:
+        return rval

    # next try to find it in the config file

@@ -77,7 +84,7 @@ def fetch_val_for_key(key):
        return theano_cfg.get(section, option)
    except (ConfigParser.NoOptionError, ConfigParser.NoSectionError):
        raise KeyError(key)
-    
+
 class TheanoConfigParser(object):
    #properties are installed by AddConfigVar

@@ -143,7 +150,7 @@ class ConfigParam(object):
            self.val = val

    deleter=None
-    
+
 class EnumStr(ConfigParam):
    def __init__(self, default, *options):
        self.default = default

--- a/theano/gof/type.py
+++ b/theano/gof/type.py
@@ -222,7 +222,7 @@ class PureType(object):
        try:
            self.filter(a, True)
            return True
-        except TypeError:
+        except (TypeError, ValueError):
            return False
    
    def make_variable(self, name = None):

--- a/theano/misc/safe_asarray.py
+++ b/theano/misc/safe_asarray.py
@@ -18,18 +18,22 @@ def _asarray(a, dtype=None, order=None):

    Currently, this issue has only been causing trouble when the target
    data type is 'int32', on some computers. As a result, this is the only
-    situation where we do more than a simple call to ``numpy.asarray``. If it
-    turns out that a similar problem can occur for more data type, this
+    situation where we may do more than a simple call to ``numpy.asarray``. If
+    it turns out that a similar problem can occur for more data type, this
    function should be updated accordingly.

    This function's name starts with a '_' to indicate that it is meant to be
    used internally. It is imported so as to be available directly through
        theano._asarray
    """
+    dtype = numpy.dtype(dtype)  # Convert into dtype object.
    rval = numpy.asarray(a, dtype=dtype, order=order)
-    if dtype is numpy.int32 or dtype == 'int32':
-        # Make sure the type is properly set to the correct type.
-        return rval.view(dtype=numpy.int32)
+    numpy_int32 = numpy.dtype(numpy.int32)
+    if (dtype is numpy_int32 and rval.dtype is not numpy_int32):
+        # Enfore the numpy.int32 dtype.
+        return rval.view(dtype=numpy_int32)
    else:
        # Using ``numpy.asarray`` should work just fine.
+        # Debug assert if we want to detect other failure cases (untested):
+        # assert rval.dtype is dtype
        return rval
--- a/theano/sandbox/conv.py
+++ b/theano/sandbox/conv.py
--- a/theano/sandbox/cuda/__init__.py
+++ b/theano/sandbox/cuda/__init__.py
@@ -5,14 +5,15 @@ from theano import config
 import logging, copy
 _logger_name = 'theano.sandbox.cuda'
 _logger = logging.getLogger(_logger_name)
-_logger.setLevel(logging.INFO)
-_logger.addHandler(logging.StreamHandler())
+_logger.setLevel(logging.WARNING)
+def error(*msg):
+    _logger.warning('ERROR (%s): '% ( _logger_name, ' '.join(str(m) for m in msg)))
 def warning(*msg):
-    _logger.warning(_logger_name+'WARNING: '+' '.join(str(m) for m in msg))
+    _logger.warning('WARNING (%s): '% ( _logger_name, ' '.join(str(m) for m in msg)))
 def info(*msg):
-    _logger.info(_logger_name+'INFO: '+' '.join(str(m) for m in msg))
+    _logger.warning('INFO (%s): '% ( _logger_name, ' '.join(str(m) for m in msg)))
 def debug(*msg):
-    _logger.debug(_logger_name+'DEBUG: '+' '.join(str(m) for m in msg))
+    _logger.warning('DEBUG (%s): '% ( _logger_name, ' '.join(str(m) for m in msg)))


 # Compile cuda_ndarray.cu
@@ -63,23 +64,32 @@ if not compile_cuda_ndarray:
    except ImportError:
        compile_cuda_ndarray = True

-if compile_cuda_ndarray:
-    import nvcc_compiler
-    if not nvcc_compiler.is_nvcc_available():
-        set_cuda_disabled()
+try:
+    if compile_cuda_ndarray:
+        import nvcc_compiler
+        if not nvcc_compiler.is_nvcc_available():
+            set_cuda_disabled()

-    if enable_cuda:
-        code = open(os.path.join(cuda_path, "cuda_ndarray.cu")).read()
+        if enable_cuda:
+            code = open(os.path.join(cuda_path, "cuda_ndarray.cu")).read()

-        if not os.path.exists(cuda_ndarray_loc):
-            os.makedirs(cuda_ndarray_loc)
+            if not os.path.exists(cuda_ndarray_loc):
+                os.makedirs(cuda_ndarray_loc)

-        nvcc_compiler.nvcc_module_compile_str('cuda_ndarray', code, location = cuda_ndarray_loc,
-                                              include_dirs=[cuda_path], libs=['cublas'])
+            nvcc_compiler.nvcc_module_compile_str('cuda_ndarray', code, location = cuda_ndarray_loc,
+                                                  include_dirs=[cuda_path], libs=['cublas'])

-        from cuda_ndarray.cuda_ndarray import *
+            from cuda_ndarray.cuda_ndarray import *
+except Exception, e:
+    error( "Failed to compile cuda_ndarray.cu: %s" % str(e))
+    set_cuda_disabled()

 if enable_cuda:
+    #check if their is an old cuda_ndarray that was loading instead of the one we compiled!
+    import cuda_ndarray.cuda_ndarray
+    if os.path.join(config.compiledir,'cuda_ndarray','cuda_ndarray.so')!=cuda_ndarray.cuda_ndarray.__file__:
+        _logger.warning("WARNING: cuda_ndarray was loaded from",cuda_ndarray.cuda_ndarray.__file__,"This is not expected as theano should compile it automatically for you. Do you have a directory called cuda_ndarray in your LD_LIBRARY_PATH environment variable? If so, please remove it as it is outdated!")
+
    from theano.sandbox.cuda.type import CudaNdarrayType
    from theano.sandbox.cuda.var import (CudaNdarrayVariable,
            CudaNdarrayConstant,
@@ -103,7 +113,7 @@ def use(device=config.device):
        raise ValueError("Invalid device identifier", device)
    if use.device_number is None:
        # No successful call to use() has been made yet
-        if device=="-1" or device=="CPU":
+        if device<0:
            return
        if device in [None,""]:
            device=0
@@ -134,6 +144,5 @@ def handle_shared_float32(tf):
    else:
        raise NotImplementedError('removing our handler')

-
 if enable_cuda and config.device.startswith('gpu'):
    use()
--- a/theano/sandbox/cuda/nvcc_compiler.py
+++ b/theano/sandbox/cuda/nvcc_compiler.py
@@ -6,6 +6,13 @@ from theano import config
 _logger=logging.getLogger("theano.sandbox.cuda.nvcc_compiler")
 _logger.setLevel(logging.WARN)

+from theano.configparser import config, AddConfigVar, StrParam
+
+AddConfigVar('nvcc.compiler_bindir',
+        "if defined, nvcc compiler driver will seek g++ and gcc in this directory",
+        StrParam(""))
+
+
 def error(*args):
    #sys.stderr.write('ERROR:'+ ' '.join(str(a) for a in args)+'\n')
    _logger.error("ERROR: "+' '.join(str(a) for a in args))
@@ -68,6 +75,8 @@ def nvcc_module_compile_str(module_name, src_code, location=None, include_dirs=[
    debug('Generating shared lib', lib_filename)
    # TODO: Why do these args cause failure on gtx285 that has 1.3 compute capability? '--gpu-architecture=compute_13', '--gpu-code=compute_13', 
    cmd = ['nvcc', '-shared', '-g'] + [pa for pa in preargs if pa.startswith('-O')]
+    if config.nvcc.compiler_bindir:
+        cmd.extend(['--compiler-bindir', config.nvcc.compiler_bindir])
    cmd.extend(['-Xcompiler', ','.join(pa for pa in preargs if not pa.startswith('-O'))])
    cmd.extend('-I%s'%idir for idir in include_dirs)
    cmd.extend(['-o',lib_filename]) 

--- a/theano/sandbox/cuda/tests/test_basic_ops.py
+++ b/theano/sandbox/cuda/tests/test_basic_ops.py
@@ -140,20 +140,20 @@ def test_elemwise1():
    b = tensor.fmatrix()

    #let debugmode catch any mistakes
-    print >> sys.stderr, "STARTING FUNCTION 1"
+    print >> sys.stdout, "STARTING FUNCTION 1"
    f = pfunc([b], [], updates=[(a, b**a)], mode=mode_with_gpu)
    for i, node in enumerate(f.maker.env.toposort()):
        print i, node
    f(numpy.random.rand(*shape)+0.3)

-    print >> sys.stderr, "STARTING FUNCTION 2"
+    print >> sys.stdout, "STARTING FUNCTION 2"
    #let debugmode catch any mistakes
    f = pfunc([b], [], updates=[(a, tensor.exp(b**a))], mode=mode_with_gpu)
    for i, node in enumerate(f.maker.env.toposort()):
        print i, node
    f(numpy.random.rand(*shape)+0.3)

-    print >> sys.stderr, "STARTING FUNCTION 3"
+    print >> sys.stdout, "STARTING FUNCTION 3"
    #let debugmode catch any mistakes
    f = pfunc([b], [], updates=[(a, a+b * tensor.exp(b**a))], mode=mode_with_gpu)
    f(numpy.random.rand(*shape)+0.3)
@@ -169,11 +169,11 @@ def test_elemwise2():
        f = pfunc([b], [], updates=[(a, (a+b).dimshuffle(pattern))], mode=mode_with_gpu)
        has_elemwise = False
        for i, node in enumerate(f.maker.env.toposort()):
-            print >> sys.stderr, i, node
+            print >> sys.stdout, i, node
            has_elemwise = has_elemwise or isinstance(node.op, tensor.Elemwise)
        assert not has_elemwise
        #let debugmode catch errors
-        print >> sys.stderr, 'pattern', pattern
+        print >> sys.stdout, 'pattern', pattern
        f(rng.rand(*shape)*.3)
    
    shape = (3,4,5,6)
@@ -204,7 +204,7 @@ def test_elemwise3():
        b**a).dimshuffle([2,0,3,1]))], mode=mode_with_gpu)
    has_elemwise = False
    for i, node in enumerate(f.maker.env.toposort()):
-        print >> sys.stderr, i, node
+        print >> sys.stdout, i, node
        has_elemwise = has_elemwise or isinstance(node.op, tensor.Elemwise)
    assert not has_elemwise
    #let debugmode catch errors
@@ -220,7 +220,7 @@ def test_elemwise4():
    f = pfunc([b,c], [], updates=[(a, (a+b.dimshuffle('x', 0)*c.dimshuffle(0, 'x')))], mode=mode_with_gpu)
    has_elemwise = False
    for i, node in enumerate(f.maker.env.toposort()):
-        print >> sys.stderr, i, node
+        print >> sys.stdout, i, node
        has_elemwise = has_elemwise or isinstance(node.op, tensor.Elemwise)
    assert not has_elemwise
    #let debugmode catch errors

--- a/theano/sandbox/cuda/tests/test_conv_cuda_ndarray.py
+++ b/theano/sandbox/cuda/tests/test_conv_cuda_ndarray.py
@@ -360,7 +360,7 @@ def test_subsample():

 def test_logical_shapes():
    # implement when
-    print >> sys.stderr, "INFO: test_logical_shapes not implemented (i.e. imshp_logical, kshp_logical, kshp_logical_top_aligned)"
+    print >> sys.stderr, "WARNING TODO: test_logical_shapes not implemented (i.e. imshp_logical, kshp_logical, kshp_logical_top_aligned)"


 def _test_dummy():

--- a/theano/sandbox/cuda/tests/test_cuda_ndarray.py
+++ b/theano/sandbox/cuda/tests/test_cuda_ndarray.py
@@ -8,7 +8,7 @@ if cuda_ndarray.enable_cuda == False:
 import numpy

 def test_host_to_device():
-    print >>sys.stderr, 'starting test_host_to_dev'
+    print >>sys.stdout, 'starting test_host_to_dev'
    for shape in ((), (3,), (2,3), (3,4,5,6)):
        a = theano._asarray(numpy.random.rand(*shape), dtype='float32')
        b = cuda_ndarray.CudaNdarray(a)
@@ -53,7 +53,7 @@ def test_add():


 def test_exp():
-    print >>sys.stderr, 'starting test_exp'
+    print >>sys.stdout, 'starting test_exp'
    for shape in ((), (3,), (2,3), (1,10000000),(10,1000000), (100,100000),(1000,10000),(10000,1000)):
        a0 = theano._asarray(numpy.random.rand(*shape), dtype='float32')
        a1 = a0.copy()
@@ -74,25 +74,25 @@ def test_exp():


 def test_copy():
-    print >>sys.stderr, 'starting test_copy'
+    print >>sys.stdout, 'starting test_copy'
    shape = (5,)
    a = theano._asarray(numpy.random.rand(*shape), dtype='float32')

-    print >>sys.stderr, '.. creating device object'
+    print >>sys.stdout, '.. creating device object'
    b = cuda_ndarray.CudaNdarray(a)

-    print >>sys.stderr, '.. copy'
+    print >>sys.stdout, '.. copy'
    c = copy.copy(b)
-    print >>sys.stderr, '.. deepcopy'
+    print >>sys.stdout, '.. deepcopy'
    d = copy.deepcopy(b)

-    print >>sys.stderr, '.. comparisons'
+    print >>sys.stdout, '.. comparisons'
    assert numpy.allclose(a, numpy.asarray(b))
    assert numpy.allclose(a, numpy.asarray(c))
    assert numpy.allclose(a, numpy.asarray(d))

 def test_dot():
-    print >>sys.stderr, 'starting test_dot'
+    print >>sys.stdout, 'starting test_dot'
    a0 = theano._asarray(numpy.random.rand(4, 7), dtype='float32')
    a1 = theano._asarray(numpy.random.rand(7, 6), dtype='float32')

@@ -101,7 +101,7 @@ def test_dot():

    assert numpy.allclose(numpy.dot(a0, a1), cuda_ndarray.dot(b0, b1))

-    print >> sys.stderr, 'WARNING test_dot: not testing all 8 transpose cases of dot'
+    print >> sys.stderr, 'WARNING TODO test_dot: not testing all 8 transpose cases of dot'

 def test_sum():
    shape = (2,3)
@@ -147,7 +147,7 @@ def test_reshape():
             ]

    def subtest(shape_1, shape_2):
-        #print >> sys.stderr, "INFO: shapes", shape_1, shape_2
+        #print >> sys.stdout, "INFO: shapes", shape_1, shape_2
        a = theano._asarray(numpy.random.rand(*shape_1), dtype='float32')
        b = cuda_ndarray.CudaNdarray(a)


--- a/theano/sandbox/downsample.py
+++ b/theano/sandbox/downsample.py
@@ -147,7 +147,7 @@ class DownsampleFactorMaxGrad(Op):
    def c_code_cache_version(self):
        return ()

-                
+
 def max_pool2D(input, ds, ignore_border=False):
    """
    Takes as input a N-D tensor, where N >= 2. It downscales the input image by
@@ -166,7 +166,7 @@ def max_pool2D(input, ds, ignore_border=False):

    # extract image dimensions
    img_shape = input.shape[-2:]
-    
+
    # count the number of "leading" dimensions, store as dmatrix
    batch_size = tensor.prod(input.shape[:-2])
    batch_size = tensor.shape_padright(batch_size,1)

--- a/theano/sandbox/test_conv.py
+++ b/theano/sandbox/test_conv.py
--- a/theano/sandbox/test_downsample.py
+++ b/theano/sandbox/test_downsample.py
--- a/theano/scalar/basic.py
+++ b/theano/scalar/basic.py
@@ -1125,7 +1125,7 @@ inv = Inv(upgrade_to_float, name = 'inv')
 class Log(UnaryScalarOp):
    """ log base e """
    def impl(self, x):
-        return math.log(x)
+        return numpy.log(x)
    def grad(self, (x, ), (gz, )):
      if x.type in grad_types:
        return gz / x,

--- a/theano/tensor/basic.py
+++ b/theano/tensor/basic.py
@@ -330,6 +330,7 @@ class TensorType(Type):
        self.broadcastable = tuple(broadcastable)
        self.dtype_specs() # error checking is done there
        self.name = name
+        self.numpy_dtype = numpy.dtype(self.dtype)
        if shape is None:
          #backport self.shape = tuple((1 if b else None) for b in self.broadcastable)
            l=[]
@@ -360,16 +361,16 @@ class TensorType(Type):
        This function is not meant to be called in user code.  It is for
        `Linker` instances to use when running a compiled graph.
        """
-        _data = data
-        if strict:
+        if (type(data) is numpy.ndarray) and (data.dtype is self.numpy_dtype):
+            pass # fall through to ndim check
+        elif strict:
+            # this is its own subcase that doesn't fall through to anything
            if not isinstance(data, numpy.ndarray):
                raise TypeError("%s expected a ndarray object.", data, type(data))
            if not str(data.dtype) == self.dtype:
                raise TypeError("%s expected a ndarray object with dtype = %s (got %s)." % (self, self.dtype, data.dtype))
            if not data.ndim == self.ndim:
                raise TypeError("%s expected a ndarray object with %s dimensions (got %s)." % (self, self.ndim, data.ndim))
-            if self.filter_checks_isfinite and (not numpy.all(numpy.isfinite(data))):
-                raise TypeError("non-finite elements not allowed")

            if TensorType.use_shape:
                for si, di in zip(self.shape, data.shape):
@@ -378,11 +379,17 @@ class TensorType(Type):
                            self, self.shape, data.shape))
            return data
        else:
-            data = theano._asarray(data, dtype = self.dtype)
-        if not self.ndim == data.ndim:
+            data = theano._asarray(data, dtype = self.dtype) #TODO - consider to pad shape with ones
+            # to make it consistent with self.broadcastable... like vector->row type thing
+        if self.ndim != data.ndim:
            raise TypeError("Wrong number of dimensions: expected %s, got %s with shape %s." % (self.ndim, data.ndim, data.shape), data)
-        if any(b and d != 1 for d, b in zip(data.shape, self.broadcastable)):
-            raise TypeError("Non-unit value on shape on a broadcastable dimension.", data.shape, self.broadcastable)
+        i = 0
+        for b in self.broadcastable:
+            if b and data.shape[i] != 1:
+                raise TypeError("Non-unit value on shape on a broadcastable dimension.", data.shape, self.broadcastable)
+            i+=1
+        if self.filter_checks_isfinite and (not numpy.all(numpy.isfinite(data))):
+            raise ValueError("non-finite elements not allowed")
        return data

    def dtype_specs(self):
@@ -1826,14 +1833,16 @@ class Default(gof.Op):
    view_map = {0: [0]}
    def make_node(self, x, default):
        x, default = as_tensor_variable(x), as_tensor_variable(default)
-        assert x.type == default.type
+        if  x.type != default.type:
+            raise TypeError('Both default() arguments must have same type', x, default)
        return gof.Apply(self, [x, default], [default.type()])
    def perform(self, node, (x, default), (out, )):
-      if x is None:
-        out[0] = default.copy()
-      else:
-        out[0] = x
-      #backport out[0] = default.copy() if x is None else x
+        if x is None:
+            # why copy?  Theano can't yet understand out[0] being a view of either x or y,
+            # so we can be a view of x, but only a copy of y.
+            out[0] = default.copy() 
+        else:
+            out[0] = x
 default = Default()
 setdefault = default # legacy

@@ -3588,8 +3597,10 @@ def verify_grad(op, pt, n_tests=2, rng=None, eps=None, tol=None, mode=None, cast

        o_fn = function(tensor_pt, o_output)
        o_fn_out = o_fn(*[p.copy() for p in pt])
-        
-        random_projection = rng.rand(*o_fn_out.shape)
+
+        # random_projection should not have elements too small,
+        # otherwise too much precision is lost in numerical gradient
+        random_projection = rng.rand(*o_fn_out.shape) + 0.5
        if cast_to_output_type:
            random_projection = numpy.array(random_projection,
                                            dtype=o_output.dtype)

--- a/theano/tensor/elemwise.py
+++ b/theano/tensor/elemwise.py
@@ -822,7 +822,14 @@ class CAReduce(Op):
        to_reduce = reversed(sorted(axis))
        if to_reduce:
            for dimension in to_reduce:
-                variable = self.ufunc.reduce(variable, dimension)
+                # If it's a zero-size array, use scalar_op.identity if available
+                if variable.shape[dimension] == 0:
+                    if hasattr(self.scalar_op, 'identity'):
+                        variable = self.scalar_op.identity
+                    else:
+                        raise ValueError("Input (%s) has zero-size on axis %s, but self.scalar_op (%s) has no attribute 'identity'" % (variable, dimension, self.scalar_op))
+                else:
+                    variable = self.ufunc.reduce(variable, dimension)
            output[0] = theano._asarray(variable, dtype = node.outputs[0].type.dtype)
        else:
            output[0] = numpy.copy(variable)

--- a/theano/tensor/tests/test_elemwise.py
+++ b/theano/tensor/tests/test_elemwise.py
@@ -133,6 +133,8 @@ class test_CAReduce(unittest.TestCase):
                           ((5, 6), (1, )),
                           ((5, 6), ()),
                           ((2, 3, 4, 5), (0, 1, 3)),
+                           ((5, 0), (0, )),
+                           ((5, 0), (1, )),
                           ((), ())]:
            x = TensorType('float64', [(entry == 1) for entry in xsh])('x')
            e = CAReduce(add, axis = tosum)(x)
@@ -149,7 +151,7 @@ class test_CAReduce(unittest.TestCase):

    def test_c(self):
        self.with_linker(gof.CLinker())
-        
+

 if __name__ == '__main__':
    unittest.main()