merge

86ec00c0 · Pascal Lamblin · 02c959a7 · 91aa8472 · 86ec00c0 · 86ec00c0
--- a/doc/tutorial/using_gpu.txt
+++ b/doc/tutorial/using_gpu.txt
@@ -58,7 +58,7 @@ file and run it.
    import numpy
    import time
-    vlen = 100000
+    vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
    iters = 1000
    rng = numpy.random.RandomState(22)
@@ -74,28 +74,31 @@ The program just computes the exp() of a bunch of random numbers.
 Note that we use the `shared` function to
 make sure that the input `x` are stored on the graphics device.
-If I run this program (in thing.py) with device=cpu, my computer takes a little over 3 seconds, whereas on the GPU it takes just over 0.2 seconds.  Note that the results are close but not identical!  The GPU will not always produce the exact same floating-point numbers as the CPU.
+If I run this program (in thing.py) with device=cpu, my computer takes a little over 7 seconds,
+whereas on the GPU it takes just over 0.4 seconds.  Note that the results are close but not
+identical!  The GPU will not always produce the exact same floating-point numbers as the CPU.
+As a point of reference, a loop that calls ``numpy.exp(x.value)`` also takes about 7 seconds.
 .. code-block:: text
    $ THEANO_FLAGS=mode=FAST_RUN,device=cpu python thing.py 
-    Looping 100 times took 3.12647008896 seconds
+    Looping 100 times took 7.17374897003 seconds
-    Result is [ 1.23178032  1.61879341  1.52278065 ...,  1.74085572  2.55530456 1.88906098]
+    Result is [ 1.23178032  1.61879341  1.52278065 ...,  2.20771815  2.29967753 1.62323285]
    bergstra@tikuanyin:~/tmp$ THEANO_FLAGS=mode=FAST_RUN,device=gpu0 python thing.py 
    Using gpu device 0: GeForce GTX 285
-    Looping 100 times took 0.217401981354 seconds
+    Looping 100 times took 0.418929815292 seconds
-    Result is [ 1.23178029  1.61879349  1.52278066 ...,  1.74085569 2.55530477 1.88906097]
+    Result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761 1.62323296]
 Returning a handle to device-allocated data
 -------------------------------------------
 The speedup is not greater in the example above because the function is
-returning its result as a numpy ndarray (which has already copied from the
+returning its result as a numpy ndarray which has already been copied from the
-device to the host).  This is what makes it so easy to swap in device=gpu0, but
+device to the host for your convenience.  This is what makes it so easy to swap in device=gpu0, but
-if you want to be less portable, you can see a bigger speedup by changing
+if you don't mind being less portable, you might prefer to see a bigger speedup by changing
 the graph to express a computation with a GPU-stored result.  The gpu_from_host
-op means "copy the input from the host to the gpu" and it is optimized away
+Op means "copy the input from the host to the gpu" and it is optimized away
 after the T.exp(x) is replaced by a GPU version of exp().
 .. code-block:: python
@@ -105,7 +108,7 @@ after the T.exp(x) is replaced by a GPU version of exp().
    import numpy
    import time
-    vlen = 100000
+    vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
    iters = 1000
    rng = numpy.random.RandomState(22)
@@ -123,17 +126,71 @@ The output from this program is
 .. code-block:: text
    Using gpu device 0: GeForce GTX 285
-    Looping 100 times took 0.173671007156 seconds
+    Looping 100 times took 0.185714006424 seconds
    Result is <CudaNdarray object at 0x3e9e970>
-    Numpy result is [ 1.23178029  1.61879349  1.52278066 ...,  1.74085569 2.55530477 1.88906097]
+    Numpy result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761 1.62323296]
-Here we've shaved off about 20% of the run-time by simply not copying the
+Here we've shaved off about 50% of the run-time by simply not copying the
 resulting array back to the host.
 The object returned by each function call is now not a numpy array but a
 "CudaNdarray" which can be converted to a numpy ndarray by the normal
 numpy casting mechanism.
+Running the GPU at Full Speed
+------------------------------
+To really get maximum performance in this simple example, we need to use an :class:`Out`
+instance to tell Theano not to copy the output it returns to us.  Theano allocates memory for
+internal use like a working buffer, but by default it will never return a result that is
+allocated in the working buffer.  This is normally what you want, but our example is so simple
+that it has the un-wanted side-effect of really slowing things down.
+.. 
+    TODO:
+    The story here about copying and working buffers is misleading and potentially not correct
+    ... why exactly does borrow=True cut 75% of the runtime ???
+.. code-block:: python
+    from theano import function, config, shared, sandbox, Out
+    import theano.tensor as T
+    import numpy
+    import time
+    vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
+    iters = 1000
+    rng = numpy.random.RandomState(22)
+    x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
+    f = function([], 
+            Out(sandbox.cuda.basic_ops.gpu_from_host(T.exp(x)),
+                borrow=True))
+    t0 = time.time()
+    for i in xrange(iters):
+        r = f()
+    print 'Looping 100 times took', time.time() - t0, 'seconds'
+    print 'Result is', r
+    print 'Numpy result is', numpy.asarray(r)
+Running this version of the code takes just under 0.05 seconds, over 140x faster than
+the CPU implementation!
+.. code-block:: text
+    Using gpu device 0: GeForce GTX 285
+    Looping 100 times took 0.0497219562531 seconds
+    Result is <CudaNdarray object at 0x31eeaf0>
+    Numpy result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761 1.62323296]
+This version of the code ``using borrow=True`` is slightly less safe because if we had saved
+the `r` returned from one function call, we would have to take care and remember that its value might
+be over-written by a subsequent function call.  Although borrow=True makes a dramatic difference in this example,
+be careful!  The advantage of
+borrow=True is much weaker in larger graphs, and there is a lot of potential for making a
+mistake by failing to account for the resulting memory aliasing.
 What can be accelerated on the GPU?
 ------------------------------------

--- a/theano/compile/function_module.py
+++ b/theano/compile/function_module.py
@@ -428,9 +428,20 @@ class Function(object):
        # Reinitialize each container's 'provided' counter
        for c in self.input_storage:
            c.provided = 0
        # Set positional arguments
-        for i, arg in enumerate(args):
+        i = 0
-            self[i] = arg
+        for arg in args:
+            #TODO: provide a Param option for skipping the filter if we
+            #      really want speed.
+            s = self.input_storage[i]
+            if arg is None:
+                s.storage[0] = arg
+            else:
+                s.storage[0] = s.type.filter(arg, strict=s.strict)
+            s.provided += 1
+            i+=1
        # Set keyword arguments
        for k, arg in kwargs.iteritems():
            self[k] = arg
@@ -448,7 +459,9 @@ class Function(object):
                            self.inv_finder[c]))
        # Do the actual work
+        t0_fn = time.time()
        self.fn()
+        dt_fn = time.time() - t0_fn
        # Retrieve the values that were computed
        outputs = [x.data for x in self.output_storage]
@@ -486,6 +499,9 @@ class Function(object):
          self.maker.mode.fct_call_time[self.name] += dt_call
          self.maker.mode.fct_call[self.name] += 1
+        self.maker.mode.call_time += dt_call
+        self.maker.mode.fn_time += dt_fn
        if self.return_none:
            return None
        elif self.unpack_single and len(outputs) == 1:

--- a/theano/compile/mode.py
+++ b/theano/compile/mode.py
@@ -172,6 +172,8 @@ class Mode(object):
        if isinstance(optimizer, gof.Query):
            self.provided_optimizer = optimizer
        self._optimizer = optimizer
+        self.call_time = 0
+        self.fn_time = 0
    def __str__(self):
        return "Mode(linker = %s, optimizer = %s)" % (self.provided_linker, self.provided_optimizer)

--- a/theano/compile/profilemode.py
+++ b/theano/compile/profilemode.py
@@ -7,6 +7,7 @@ from theano.gof.cc import OpWiseCLinker
 from theano.gof.python25 import any
 from theano import gof
 from theano.configparser import config, AddConfigVar, IntParam
+from theano.compile.function_module import FunctionMaker
 import_time = time.time()
@@ -18,44 +19,57 @@ AddConfigVar('ProfileMode.n_ops_to_print',
        "Number of ops to print by default",
        IntParam(20, lambda i: i > 0))
+class Profile_Maker(FunctionMaker):
+    def create(self, input_storage=None, trustme=False):
+        ret = super(Profile_Maker,self).create(input_storage, trustme)
+        for i, node in enumerate(ret.maker.env.toposort()):
+            self.mode.apply_time[(i,node.op)]=0.0
+            self.mode.apply_call[(i,node.op)]=0
+#            self.mode.op_cimpl[node.op] = 
+        return ret
 class ProfileMode(Mode):
    def __init__(self, linker=default_linker, optimizer=default_optimizer):
        local_time = [0.0]
        apply_time = {}
        apply_call = {}
-        op_time = {}
        op_cimpl = {}
-        op_call = {}
        compile_time = 0 #time passed in theano.function()
        fct_call_time = {}#time passed inside theano fct call including op time.
        fct_call = {}
        self.__setstate__((linker, optimizer, local_time,
                           apply_time, apply_call,
-                           op_time, op_cimpl, op_call, 
+                           op_cimpl,
                           compile_time, fct_call_time, fct_call))
+    def function_maker(self, i,o,m, *args, **kwargs):
+        """Return an instance of `Profiler_Maker` which init the count"""
+        assert m is self
+        return Profile_Maker(i, o, self, *args, **kwargs)
    def __getstate__(self):
        #print "__getstate__",self.provided_linker,self.provided_optimizer
        return (self.provided_linker, self.provided_optimizer, self.local_time,
                self.apply_time, self.apply_call,
-                self.op_time, self.op_cimpl, self.op_call, self.compile_time, self.fct_call_time, self.fct_call)
+                self.op_cimpl, self.compile_time, self.fct_call_time, self.fct_call)
    def __setstate__(self, (linker, optimizer, local_time,
                            apply_time, apply_call,
-                            op_time, op_cimpl, op_call, 
+                            op_cimpl,
                            compile_time, fct_call_time, fct_call)):
        self.local_time = local_time
        self.apply_time = apply_time
        self.apply_call = apply_call
-        self.op_time = op_time
        self.op_cimpl = op_cimpl
-        self.op_call = op_call
        self.compile_time = compile_time
        self.fct_call_time = fct_call_time
        self.fct_call = fct_call
+        self.call_time = 0
+        self.fn_time = 0
        def blah(i, node, th):
            if hasattr(th, 'cthunk'):
@@ -72,11 +86,9 @@ class ProfileMode(Mode):
                dt = time.time() - t0
            local_time[0] += dt
-            apply_time[(i,node.op)] = apply_time.get((i,node.op), 0.0) + dt
+            apply_time[(i,node.op)] += dt
-            apply_call[(i,node.op)] = apply_call.get((i,node.op), 0) + 1
+            apply_call[(i,node.op)] += 1
-            op_time[node.op] = op_time.get(node.op, 0.0) + dt
            op_cimpl[node.op] = hasattr(th, 'cthunk')
-            op_call[node.op] = op_call.get(node.op,0) + 1
        self.provided_linker = linker
@@ -113,18 +125,11 @@ class ProfileMode(Mode):
        fct_call = self.fct_call
        apply_time = self.apply_time
        apply_call = self.apply_call
-        op_time = self.op_time
-        op_call = self.op_call
        op_cimpl = self.op_cimpl
-        op_flops = {}
-        for a,t in op_time.items():
-            if hasattr(a,'flops'):
-                op_flops[a]=a.flops*op_call[a]/t/1e6
        self.print_summary_("print_summary",local_time, compile_time, fct_call_time, fct_call,
-                            apply_time, apply_call, op_time, op_call, op_cimpl,
+                            apply_time, apply_call, op_cimpl,
-                            op_flops, n_apply_to_print, n_ops_to_print)
+                            n_apply_to_print, n_ops_to_print)
    def print_diff_summary(self, other, n_apply_to_print=15, n_ops_to_print=20):
@@ -153,42 +158,23 @@ class ProfileMode(Mode):
                r[a]+=t
            return r
-        def diff_dict_flops(a_time,b_time_,a_call,b_call):
-            flops = {}
-            b_time = copy.copy(b_time_)
-            for a,ta in a_time.items():
-                tb = b_time.pop(a,0)
-                if hasattr(a,'flops'):
-                    flops[a]=a.flops*a_call[a]/ta - a.flops*b_call[a]/tb/1e6
-            #they are missing in a
-            for b,tb in b_time.items():
-                if hasattr(b,'flops'):
-                    flops[b]=b.flops*b_call[b]/tb/1e6
-            return flops
        local_time = self.local_time[0]-other.local_time[0]
        compile_time = self.compile_time-other.compile_time
        fct_call_time = diff_dict(self.fct_call_time,other.fct_call_time)
        fct_call = diff_dict(self.fct_call,other.fct_call)
        apply_time = diff_dict(self.apply_time, other.apply_time)
        apply_call = diff_dict(self.apply_call, other.apply_call)
-        op_time = diff_dict(self.op_time, other.op_time)
-        op_call = diff_dict(self.op_call, other.op_call)
        op_cimpl = self.op_cimpl and other.op_cimpl
-        op_flops = diff_dict_flops(self.op_time, other.op_time, self.op_call, other.op_call)
        self.print_summary_("print_diff_summary",local_time, compile_time, fct_call_time, fct_call,
-                            apply_time, apply_call, op_time, op_call, op_cimpl,
+                            apply_time, apply_call, op_cimpl,
-                            op_flops, n_apply_to_print=n_apply_to_print,
+                            n_apply_to_print=n_apply_to_print,
                            n_ops_to_print=n_ops_to_print, print_apply=False)
    @staticmethod
    def print_summary_(fct_name, local_time, compile_time, fct_call_time, fct_call,
-                       apply_time, apply_call, op_time, op_call, op_cimpl,
+                       apply_time, apply_call, op_cimpl,
-                       op_flops=None, n_apply_to_print=15, n_ops_to_print=20, print_apply=True):
+                       n_apply_to_print=15, n_ops_to_print=20, print_apply=True):
        """
        do the actual printing of print_summary and print_diff_summary.
@@ -218,6 +204,19 @@ class ProfileMode(Mode):
                      sum(f for f, t, a, nb_call in atimes[n_apply_to_print:])*100,
                      sum(t for f, t, a, nb_call in atimes[n_apply_to_print:]))
+        op_time = {}
+        op_call = {}
+        for (i,a),t in apply_time.items():
+            op_time.setdefault(a,0)
+            op_call.setdefault(a,0)
+            op_time[a]+=t
+            op_call[a]+=apply_call[(i,a)]
+        op_flops = {}
+        for a,t in op_time.items():
+            if hasattr(a,'flops'):
+                op_flops[a]=a.flops*op_call[a]/t/1e6
        flops_msg=''
        if op_flops:
            flops_msg=' <MFlops/s>'

--- a/theano/compile/tests/test_debugmode.py
+++ b/theano/compile/tests/test_debugmode.py
@@ -544,35 +544,20 @@ class Test_check_isfinite(unittest.TestCase):
        theano.tensor.TensorType.filter_checks_isfinite = self.old_val
    def test_check_isfinite(self):
-        x = theano.tensor.dvector()
+        x = theano.tensor.vector()
        f = theano.function([x], (x+2) * 5, mode='DEBUG_MODE')
+        g = theano.function([x], theano.tensor.log(x), mode='DEBUG_MODE')
        # this should work
        f(numpy.log([3, 4, 5]))
-        # this should raise InvalidValueError
+        # passing an invalid value as an input should trigger ValueError
-        try:
+        self.failUnlessRaises(ValueError, f, numpy.log([3, -4, 5]))
-            # insert a NaN
+        self.failUnlessRaises(ValueError, f, numpy.asarray([0, 1.0, 0])/0)
-            f(numpy.log([3, -4, 5]))
+        self.failUnlessRaises(ValueError, f, numpy.asarray([1.0, 1.0, 1.0])/0)
-            assert False
-        except debugmode.InvalidValueError:
-            pass
-        # this should raise InvalidValueError
-        try:
-            # insert an Nan and Inf
-            f(numpy.asarray([0, 1.0, 0])/0)
-            assert False
-        except debugmode.InvalidValueError:
-            pass
-        # this should raise InvalidValueError
+        # generating an invalid value internally should trigger InvalidValueError
-        try:
+        self.failUnlessRaises(debugmode.InvalidValueError, g, [3,-4,5])
-            # insert several Inf
-            f(numpy.asarray([1.0, 1.0, 1.0])/0)
-            assert False
-        except debugmode.InvalidValueError:
-            pass
        # this should disable the exception
        theano.tensor.TensorType.filter_checks_isfinite = False

--- a/theano/gof/type.py
+++ b/theano/gof/type.py
@@ -222,7 +222,7 @@ class PureType(object):
        try:
            self.filter(a, True)
            return True
-        except TypeError:
+        except (TypeError, ValueError):
            return False
    def make_variable(self, name = None):

--- a/theano/sandbox/conv.py
+++ b/theano/sandbox/conv.py
--- a/theano/sandbox/cuda/__init__.py
+++ b/theano/sandbox/cuda/__init__.py
@@ -5,14 +5,15 @@ from theano import config
 import logging, copy
 _logger_name = 'theano.sandbox.cuda'
 _logger = logging.getLogger(_logger_name)
-_logger.setLevel(logging.INFO)
+_logger.setLevel(logging.WARNING)
-_logger.addHandler(logging.StreamHandler())
+def error(*msg):
+    _logger.warning('ERROR (%s): '% ( _logger_name, ' '.join(str(m) for m in msg)))
 def warning(*msg):
-    _logger.warning(_logger_name+'WARNING: '+' '.join(str(m) for m in msg))
+    _logger.warning('WARNING (%s): '% ( _logger_name, ' '.join(str(m) for m in msg)))
 def info(*msg):
-    _logger.info(_logger_name+'INFO: '+' '.join(str(m) for m in msg))
+    _logger.warning('INFO (%s): '% ( _logger_name, ' '.join(str(m) for m in msg)))
 def debug(*msg):
-    _logger.debug(_logger_name+'DEBUG: '+' '.join(str(m) for m in msg))
+    _logger.warning('DEBUG (%s): '% ( _logger_name, ' '.join(str(m) for m in msg)))
 # Compile cuda_ndarray.cu
@@ -63,23 +64,32 @@ if not compile_cuda_ndarray:
    except ImportError:
        compile_cuda_ndarray = True
-if compile_cuda_ndarray:
+try:
-    import nvcc_compiler
+    if compile_cuda_ndarray:
-    if not nvcc_compiler.is_nvcc_available():
+        import nvcc_compiler
-        set_cuda_disabled()
+        if not nvcc_compiler.is_nvcc_available():
+            set_cuda_disabled()
-    if enable_cuda:
+        if enable_cuda:
-        code = open(os.path.join(cuda_path, "cuda_ndarray.cu")).read()
+            code = open(os.path.join(cuda_path, "cuda_ndarray.cu")).read()
-        if not os.path.exists(cuda_ndarray_loc):
+            if not os.path.exists(cuda_ndarray_loc):
-            os.makedirs(cuda_ndarray_loc)
+                os.makedirs(cuda_ndarray_loc)
-        nvcc_compiler.nvcc_module_compile_str('cuda_ndarray', code, location = cuda_ndarray_loc,
+            nvcc_compiler.nvcc_module_compile_str('cuda_ndarray', code, location = cuda_ndarray_loc,
-                                              include_dirs=[cuda_path], libs=['cublas'])
+                                                  include_dirs=[cuda_path], libs=['cublas'])
-        from cuda_ndarray.cuda_ndarray import *
+            from cuda_ndarray.cuda_ndarray import *
+except Exception, e:
+    error( "Failed to compile cuda_ndarray.cu: %s" % str(e))
+    set_cuda_disabled()
 if enable_cuda:
+    #check if their is an old cuda_ndarray that was loading instead of the one we compiled!
+    import cuda_ndarray.cuda_ndarray
+    if os.path.join(config.compiledir,'cuda_ndarray','cuda_ndarray.so')!=cuda_ndarray.cuda_ndarray.__file__:
+        _logger.warning("WARNING: cuda_ndarray was loaded from",cuda_ndarray.cuda_ndarray.__file__,"This is not expected as theano should compile it automatically for you. Do you have a directory called cuda_ndarray in your LD_LIBRARY_PATH environment variable? If so, please remove it as it is outdated!")
    from theano.sandbox.cuda.type import CudaNdarrayType
    from theano.sandbox.cuda.var import (CudaNdarrayVariable,
            CudaNdarrayConstant,
@@ -103,7 +113,7 @@ def use(device=config.device):
        raise ValueError("Invalid device identifier", device)
    if use.device_number is None:
        # No successful call to use() has been made yet
-        if device=="-1" or device=="CPU":
+        if device<0:
            return
        if device in [None,""]:
            device=0
@@ -134,6 +144,5 @@ def handle_shared_float32(tf):
    else:
        raise NotImplementedError('removing our handler')
 if enable_cuda and config.device.startswith('gpu'):
    use()
--- a/theano/sandbox/cuda/nvcc_compiler.py
+++ b/theano/sandbox/cuda/nvcc_compiler.py
@@ -6,6 +6,13 @@ from theano import config
 _logger=logging.getLogger("theano.sandbox.cuda.nvcc_compiler")
 _logger.setLevel(logging.WARN)
+from theano.configparser import config, AddConfigVar, StrParam
+AddConfigVar('nvcc.compiler_bindir',
+        "if defined, nvcc compiler driver will seek g++ and gcc in this directory",
+        StrParam(""))
 def error(*args):
    #sys.stderr.write('ERROR:'+ ' '.join(str(a) for a in args)+'\n')
    _logger.error("ERROR: "+' '.join(str(a) for a in args))
@@ -68,6 +75,8 @@ def nvcc_module_compile_str(module_name, src_code, location=None, include_dirs=[
    debug('Generating shared lib', lib_filename)
    # TODO: Why do these args cause failure on gtx285 that has 1.3 compute capability? '--gpu-architecture=compute_13', '--gpu-code=compute_13', 
    cmd = ['nvcc', '-shared', '-g'] + [pa for pa in preargs if pa.startswith('-O')]
+    if config.nvcc.compiler_bindir:
+        cmd.extend(['--compiler-bindir', config.nvcc.compiler_bindir])
    cmd.extend(['-Xcompiler', ','.join(pa for pa in preargs if not pa.startswith('-O'))])
    cmd.extend('-I%s'%idir for idir in include_dirs)
    cmd.extend(['-o',lib_filename]) 

--- a/theano/sandbox/cuda/tests/test_basic_ops.py
+++ b/theano/sandbox/cuda/tests/test_basic_ops.py
@@ -140,20 +140,20 @@ def test_elemwise1():
    b = tensor.fmatrix()
    #let debugmode catch any mistakes
-    print >> sys.stderr, "STARTING FUNCTION 1"
+    print >> sys.stdout, "STARTING FUNCTION 1"
    f = pfunc([b], [], updates=[(a, b**a)], mode=mode_with_gpu)
    for i, node in enumerate(f.maker.env.toposort()):
        print i, node
    f(numpy.random.rand(*shape)+0.3)
-    print >> sys.stderr, "STARTING FUNCTION 2"
+    print >> sys.stdout, "STARTING FUNCTION 2"
    #let debugmode catch any mistakes
    f = pfunc([b], [], updates=[(a, tensor.exp(b**a))], mode=mode_with_gpu)
    for i, node in enumerate(f.maker.env.toposort()):
        print i, node
    f(numpy.random.rand(*shape)+0.3)
-    print >> sys.stderr, "STARTING FUNCTION 3"
+    print >> sys.stdout, "STARTING FUNCTION 3"
    #let debugmode catch any mistakes
    f = pfunc([b], [], updates=[(a, a+b * tensor.exp(b**a))], mode=mode_with_gpu)
    f(numpy.random.rand(*shape)+0.3)
@@ -169,11 +169,11 @@ def test_elemwise2():
        f = pfunc([b], [], updates=[(a, (a+b).dimshuffle(pattern))], mode=mode_with_gpu)
        has_elemwise = False
        for i, node in enumerate(f.maker.env.toposort()):
-            print >> sys.stderr, i, node
+            print >> sys.stdout, i, node
            has_elemwise = has_elemwise or isinstance(node.op, tensor.Elemwise)
        assert not has_elemwise
        #let debugmode catch errors
-        print >> sys.stderr, 'pattern', pattern
+        print >> sys.stdout, 'pattern', pattern
        f(rng.rand(*shape)*.3)
    shape = (3,4,5,6)
@@ -204,7 +204,7 @@ def test_elemwise3():
        b**a).dimshuffle([2,0,3,1]))], mode=mode_with_gpu)
    has_elemwise = False
    for i, node in enumerate(f.maker.env.toposort()):
-        print >> sys.stderr, i, node
+        print >> sys.stdout, i, node
        has_elemwise = has_elemwise or isinstance(node.op, tensor.Elemwise)
    assert not has_elemwise
    #let debugmode catch errors
@@ -220,7 +220,7 @@ def test_elemwise4():
    f = pfunc([b,c], [], updates=[(a, (a+b.dimshuffle('x', 0)*c.dimshuffle(0, 'x')))], mode=mode_with_gpu)
    has_elemwise = False
    for i, node in enumerate(f.maker.env.toposort()):
-        print >> sys.stderr, i, node
+        print >> sys.stdout, i, node
        has_elemwise = has_elemwise or isinstance(node.op, tensor.Elemwise)
    assert not has_elemwise
    #let debugmode catch errors

--- a/theano/sandbox/cuda/tests/test_conv_cuda_ndarray.py
+++ b/theano/sandbox/cuda/tests/test_conv_cuda_ndarray.py
@@ -360,7 +360,7 @@ def test_subsample():
 def test_logical_shapes():
    # implement when
-    print >> sys.stderr, "INFO: test_logical_shapes not implemented (i.e. imshp_logical, kshp_logical, kshp_logical_top_aligned)"
+    print >> sys.stderr, "WARNING TODO: test_logical_shapes not implemented (i.e. imshp_logical, kshp_logical, kshp_logical_top_aligned)"
 def _test_dummy():

--- a/theano/sandbox/cuda/tests/test_cuda_ndarray.py
+++ b/theano/sandbox/cuda/tests/test_cuda_ndarray.py
@@ -8,7 +8,7 @@ if cuda_ndarray.enable_cuda == False:
 import numpy
 def test_host_to_device():
-    print >>sys.stderr, 'starting test_host_to_dev'
+    print >>sys.stdout, 'starting test_host_to_dev'
    for shape in ((), (3,), (2,3), (3,4,5,6)):
        a = theano._asarray(numpy.random.rand(*shape), dtype='float32')
        b = cuda_ndarray.CudaNdarray(a)
@@ -53,7 +53,7 @@ def test_add():
 def test_exp():
-    print >>sys.stderr, 'starting test_exp'
+    print >>sys.stdout, 'starting test_exp'
    for shape in ((), (3,), (2,3), (1,10000000),(10,1000000), (100,100000),(1000,10000),(10000,1000)):
        a0 = theano._asarray(numpy.random.rand(*shape), dtype='float32')
        a1 = a0.copy()
@@ -74,25 +74,25 @@ def test_exp():
 def test_copy():
-    print >>sys.stderr, 'starting test_copy'
+    print >>sys.stdout, 'starting test_copy'
    shape = (5,)
    a = theano._asarray(numpy.random.rand(*shape), dtype='float32')
-    print >>sys.stderr, '.. creating device object'
+    print >>sys.stdout, '.. creating device object'
    b = cuda_ndarray.CudaNdarray(a)
-    print >>sys.stderr, '.. copy'
+    print >>sys.stdout, '.. copy'
    c = copy.copy(b)
-    print >>sys.stderr, '.. deepcopy'
+    print >>sys.stdout, '.. deepcopy'
    d = copy.deepcopy(b)
-    print >>sys.stderr, '.. comparisons'
+    print >>sys.stdout, '.. comparisons'
    assert numpy.allclose(a, numpy.asarray(b))
    assert numpy.allclose(a, numpy.asarray(c))
    assert numpy.allclose(a, numpy.asarray(d))
 def test_dot():
-    print >>sys.stderr, 'starting test_dot'
+    print >>sys.stdout, 'starting test_dot'
    a0 = theano._asarray(numpy.random.rand(4, 7), dtype='float32')
    a1 = theano._asarray(numpy.random.rand(7, 6), dtype='float32')
@@ -101,7 +101,7 @@ def test_dot():
    assert numpy.allclose(numpy.dot(a0, a1), cuda_ndarray.dot(b0, b1))
-    print >> sys.stderr, 'WARNING test_dot: not testing all 8 transpose cases of dot'
+    print >> sys.stderr, 'WARNING TODO test_dot: not testing all 8 transpose cases of dot'
 def test_sum():
    shape = (2,3)
@@ -147,7 +147,7 @@ def test_reshape():
             ]
    def subtest(shape_1, shape_2):
-        #print >> sys.stderr, "INFO: shapes", shape_1, shape_2
+        #print >> sys.stdout, "INFO: shapes", shape_1, shape_2
        a = theano._asarray(numpy.random.rand(*shape_1), dtype='float32')
        b = cuda_ndarray.CudaNdarray(a)

--- a/theano/sandbox/test_conv.py
+++ b/theano/sandbox/test_conv.py
--- a/theano/scalar/basic.py
+++ b/theano/scalar/basic.py
@@ -1125,7 +1125,7 @@ inv = Inv(upgrade_to_float, name = 'inv')
 class Log(UnaryScalarOp):
    """ log base e """
    def impl(self, x):
-        return math.log(x)
+        return numpy.log(x)
    def grad(self, (x, ), (gz, )):
      if x.type in grad_types:
        return gz / x,

--- a/theano/tensor/basic.py
+++ b/theano/tensor/basic.py
@@ -330,6 +330,7 @@ class TensorType(Type):
        self.broadcastable = tuple(broadcastable)
        self.dtype_specs() # error checking is done there
        self.name = name
+        self.numpy_dtype = numpy.dtype(self.dtype)
        if shape is None:
          #backport self.shape = tuple((1 if b else None) for b in self.broadcastable)
            l=[]
@@ -360,16 +361,16 @@ class TensorType(Type):
        This function is not meant to be called in user code.  It is for
        `Linker` instances to use when running a compiled graph.
        """
-        _data = data
+        if (type(data) is numpy.ndarray) and (data.dtype is self.numpy_dtype):
-        if strict:
+            pass # fall through to ndim check
+        elif strict:
+            # this is its own subcase that doesn't fall through to anything
            if not isinstance(data, numpy.ndarray):
                raise TypeError("%s expected a ndarray object.", data, type(data))
            if not str(data.dtype) == self.dtype:
                raise TypeError("%s expected a ndarray object with dtype = %s (got %s)." % (self, self.dtype, data.dtype))
            if not data.ndim == self.ndim:
                raise TypeError("%s expected a ndarray object with %s dimensions (got %s)." % (self, self.ndim, data.ndim))
-            if self.filter_checks_isfinite and (not numpy.all(numpy.isfinite(data))):
-                raise TypeError("non-finite elements not allowed")
            if TensorType.use_shape:
                for si, di in zip(self.shape, data.shape):
@@ -378,11 +379,17 @@ class TensorType(Type):
                            self, self.shape, data.shape))
            return data
        else:
-            data = theano._asarray(data, dtype = self.dtype)
+            data = theano._asarray(data, dtype = self.dtype) #TODO - consider to pad shape with ones
-        if not self.ndim == data.ndim:
+            # to make it consistent with self.broadcastable... like vector->row type thing
+        if self.ndim != data.ndim:
            raise TypeError("Wrong number of dimensions: expected %s, got %s with shape %s." % (self.ndim, data.ndim, data.shape), data)
-        if any(b and d != 1 for d, b in zip(data.shape, self.broadcastable)):
+        i = 0
-            raise TypeError("Non-unit value on shape on a broadcastable dimension.", data.shape, self.broadcastable)
+        for b in self.broadcastable:
+            if b and data.shape[i] != 1:
+                raise TypeError("Non-unit value on shape on a broadcastable dimension.", data.shape, self.broadcastable)
+            i+=1
+        if self.filter_checks_isfinite and (not numpy.all(numpy.isfinite(data))):
+            raise ValueError("non-finite elements not allowed")
        return data
    def dtype_specs(self):
@@ -1826,14 +1833,16 @@ class Default(gof.Op):
    view_map = {0: [0]}
    def make_node(self, x, default):
        x, default = as_tensor_variable(x), as_tensor_variable(default)
-        assert x.type == default.type
+        if  x.type != default.type:
+            raise TypeError('Both default() arguments must have same type', x, default)
        return gof.Apply(self, [x, default], [default.type()])
    def perform(self, node, (x, default), (out, )):
-      if x is None:
+        if x is None:
-        out[0] = default.copy()
+            # why copy?  Theano can't yet understand out[0] being a view of either x or y,
-      else:
+            # so we can be a view of x, but only a copy of y.
-        out[0] = x
+            out[0] = default.copy() 
-      #backport out[0] = default.copy() if x is None else x
+        else:
+            out[0] = x
 default = Default()
 setdefault = default # legacy