提交 86ec00c0 authored 作者: Pascal Lamblin's avatar Pascal Lamblin

merge

...@@ -58,7 +58,7 @@ file and run it. ...@@ -58,7 +58,7 @@ file and run it.
import numpy import numpy
import time import time
vlen = 100000 vlen = 10 * 30 * 768 # 10 x #cores x # threads per core
iters = 1000 iters = 1000
rng = numpy.random.RandomState(22) rng = numpy.random.RandomState(22)
...@@ -74,28 +74,31 @@ The program just computes the exp() of a bunch of random numbers. ...@@ -74,28 +74,31 @@ The program just computes the exp() of a bunch of random numbers.
Note that we use the `shared` function to Note that we use the `shared` function to
make sure that the input `x` are stored on the graphics device. make sure that the input `x` are stored on the graphics device.
If I run this program (in thing.py) with device=cpu, my computer takes a little over 3 seconds, whereas on the GPU it takes just over 0.2 seconds. Note that the results are close but not identical! The GPU will not always produce the exact same floating-point numbers as the CPU. If I run this program (in thing.py) with device=cpu, my computer takes a little over 7 seconds,
whereas on the GPU it takes just over 0.4 seconds. Note that the results are close but not
identical! The GPU will not always produce the exact same floating-point numbers as the CPU.
As a point of reference, a loop that calls ``numpy.exp(x.value)`` also takes about 7 seconds.
.. code-block:: text .. code-block:: text
$ THEANO_FLAGS=mode=FAST_RUN,device=cpu python thing.py $ THEANO_FLAGS=mode=FAST_RUN,device=cpu python thing.py
Looping 100 times took 3.12647008896 seconds Looping 100 times took 7.17374897003 seconds
Result is [ 1.23178032 1.61879341 1.52278065 ..., 1.74085572 2.55530456 1.88906098] Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753 1.62323285]
bergstra@tikuanyin:~/tmp$ THEANO_FLAGS=mode=FAST_RUN,device=gpu0 python thing.py bergstra@tikuanyin:~/tmp$ THEANO_FLAGS=mode=FAST_RUN,device=gpu0 python thing.py
Using gpu device 0: GeForce GTX 285 Using gpu device 0: GeForce GTX 285
Looping 100 times took 0.217401981354 seconds Looping 100 times took 0.418929815292 seconds
Result is [ 1.23178029 1.61879349 1.52278066 ..., 1.74085569 2.55530477 1.88906097] Result is [ 1.23178029 1.61879349 1.52278066 ..., 2.20771813 2.29967761 1.62323296]
Returning a handle to device-allocated data Returning a handle to device-allocated data
------------------------------------------- -------------------------------------------
The speedup is not greater in the example above because the function is The speedup is not greater in the example above because the function is
returning its result as a numpy ndarray (which has already copied from the returning its result as a numpy ndarray which has already been copied from the
device to the host). This is what makes it so easy to swap in device=gpu0, but device to the host for your convenience. This is what makes it so easy to swap in device=gpu0, but
if you want to be less portable, you can see a bigger speedup by changing if you don't mind being less portable, you might prefer to see a bigger speedup by changing
the graph to express a computation with a GPU-stored result. The gpu_from_host the graph to express a computation with a GPU-stored result. The gpu_from_host
op means "copy the input from the host to the gpu" and it is optimized away Op means "copy the input from the host to the gpu" and it is optimized away
after the T.exp(x) is replaced by a GPU version of exp(). after the T.exp(x) is replaced by a GPU version of exp().
.. code-block:: python .. code-block:: python
...@@ -105,7 +108,7 @@ after the T.exp(x) is replaced by a GPU version of exp(). ...@@ -105,7 +108,7 @@ after the T.exp(x) is replaced by a GPU version of exp().
import numpy import numpy
import time import time
vlen = 100000 vlen = 10 * 30 * 768 # 10 x #cores x # threads per core
iters = 1000 iters = 1000
rng = numpy.random.RandomState(22) rng = numpy.random.RandomState(22)
...@@ -123,17 +126,71 @@ The output from this program is ...@@ -123,17 +126,71 @@ The output from this program is
.. code-block:: text .. code-block:: text
Using gpu device 0: GeForce GTX 285 Using gpu device 0: GeForce GTX 285
Looping 100 times took 0.173671007156 seconds Looping 100 times took 0.185714006424 seconds
Result is <CudaNdarray object at 0x3e9e970> Result is <CudaNdarray object at 0x3e9e970>
Numpy result is [ 1.23178029 1.61879349 1.52278066 ..., 1.74085569 2.55530477 1.88906097] Numpy result is [ 1.23178029 1.61879349 1.52278066 ..., 2.20771813 2.29967761 1.62323296]
Here we've shaved off about 20% of the run-time by simply not copying the Here we've shaved off about 50% of the run-time by simply not copying the
resulting array back to the host. resulting array back to the host.
The object returned by each function call is now not a numpy array but a The object returned by each function call is now not a numpy array but a
"CudaNdarray" which can be converted to a numpy ndarray by the normal "CudaNdarray" which can be converted to a numpy ndarray by the normal
numpy casting mechanism. numpy casting mechanism.
Running the GPU at Full Speed
------------------------------
To really get maximum performance in this simple example, we need to use an :class:`Out`
instance to tell Theano not to copy the output it returns to us. Theano allocates memory for
internal use like a working buffer, but by default it will never return a result that is
allocated in the working buffer. This is normally what you want, but our example is so simple
that it has the un-wanted side-effect of really slowing things down.
..
TODO:
The story here about copying and working buffers is misleading and potentially not correct
... why exactly does borrow=True cut 75% of the runtime ???
.. code-block:: python
from theano import function, config, shared, sandbox, Out
import theano.tensor as T
import numpy
import time
vlen = 10 * 30 * 768 # 10 x #cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([],
Out(sandbox.cuda.basic_ops.gpu_from_host(T.exp(x)),
borrow=True))
t0 = time.time()
for i in xrange(iters):
r = f()
print 'Looping 100 times took', time.time() - t0, 'seconds'
print 'Result is', r
print 'Numpy result is', numpy.asarray(r)
Running this version of the code takes just under 0.05 seconds, over 140x faster than
the CPU implementation!
.. code-block:: text
Using gpu device 0: GeForce GTX 285
Looping 100 times took 0.0497219562531 seconds
Result is <CudaNdarray object at 0x31eeaf0>
Numpy result is [ 1.23178029 1.61879349 1.52278066 ..., 2.20771813 2.29967761 1.62323296]
This version of the code ``using borrow=True`` is slightly less safe because if we had saved
the `r` returned from one function call, we would have to take care and remember that its value might
be over-written by a subsequent function call. Although borrow=True makes a dramatic difference in this example,
be careful! The advantage of
borrow=True is much weaker in larger graphs, and there is a lot of potential for making a
mistake by failing to account for the resulting memory aliasing.
What can be accelerated on the GPU? What can be accelerated on the GPU?
------------------------------------ ------------------------------------
......
...@@ -428,9 +428,20 @@ class Function(object): ...@@ -428,9 +428,20 @@ class Function(object):
# Reinitialize each container's 'provided' counter # Reinitialize each container's 'provided' counter
for c in self.input_storage: for c in self.input_storage:
c.provided = 0 c.provided = 0
# Set positional arguments # Set positional arguments
for i, arg in enumerate(args): i = 0
self[i] = arg for arg in args:
#TODO: provide a Param option for skipping the filter if we
# really want speed.
s = self.input_storage[i]
if arg is None:
s.storage[0] = arg
else:
s.storage[0] = s.type.filter(arg, strict=s.strict)
s.provided += 1
i+=1
# Set keyword arguments # Set keyword arguments
for k, arg in kwargs.iteritems(): for k, arg in kwargs.iteritems():
self[k] = arg self[k] = arg
...@@ -448,7 +459,9 @@ class Function(object): ...@@ -448,7 +459,9 @@ class Function(object):
self.inv_finder[c])) self.inv_finder[c]))
# Do the actual work # Do the actual work
t0_fn = time.time()
self.fn() self.fn()
dt_fn = time.time() - t0_fn
# Retrieve the values that were computed # Retrieve the values that were computed
outputs = [x.data for x in self.output_storage] outputs = [x.data for x in self.output_storage]
...@@ -486,6 +499,9 @@ class Function(object): ...@@ -486,6 +499,9 @@ class Function(object):
self.maker.mode.fct_call_time[self.name] += dt_call self.maker.mode.fct_call_time[self.name] += dt_call
self.maker.mode.fct_call[self.name] += 1 self.maker.mode.fct_call[self.name] += 1
self.maker.mode.call_time += dt_call
self.maker.mode.fn_time += dt_fn
if self.return_none: if self.return_none:
return None return None
elif self.unpack_single and len(outputs) == 1: elif self.unpack_single and len(outputs) == 1:
......
...@@ -172,6 +172,8 @@ class Mode(object): ...@@ -172,6 +172,8 @@ class Mode(object):
if isinstance(optimizer, gof.Query): if isinstance(optimizer, gof.Query):
self.provided_optimizer = optimizer self.provided_optimizer = optimizer
self._optimizer = optimizer self._optimizer = optimizer
self.call_time = 0
self.fn_time = 0
def __str__(self): def __str__(self):
return "Mode(linker = %s, optimizer = %s)" % (self.provided_linker, self.provided_optimizer) return "Mode(linker = %s, optimizer = %s)" % (self.provided_linker, self.provided_optimizer)
......
...@@ -7,6 +7,7 @@ from theano.gof.cc import OpWiseCLinker ...@@ -7,6 +7,7 @@ from theano.gof.cc import OpWiseCLinker
from theano.gof.python25 import any from theano.gof.python25 import any
from theano import gof from theano import gof
from theano.configparser import config, AddConfigVar, IntParam from theano.configparser import config, AddConfigVar, IntParam
from theano.compile.function_module import FunctionMaker
import_time = time.time() import_time = time.time()
...@@ -18,44 +19,57 @@ AddConfigVar('ProfileMode.n_ops_to_print', ...@@ -18,44 +19,57 @@ AddConfigVar('ProfileMode.n_ops_to_print',
"Number of ops to print by default", "Number of ops to print by default",
IntParam(20, lambda i: i > 0)) IntParam(20, lambda i: i > 0))
class Profile_Maker(FunctionMaker):
def create(self, input_storage=None, trustme=False):
ret = super(Profile_Maker,self).create(input_storage, trustme)
for i, node in enumerate(ret.maker.env.toposort()):
self.mode.apply_time[(i,node.op)]=0.0
self.mode.apply_call[(i,node.op)]=0
# self.mode.op_cimpl[node.op] =
return ret
class ProfileMode(Mode): class ProfileMode(Mode):
def __init__(self, linker=default_linker, optimizer=default_optimizer): def __init__(self, linker=default_linker, optimizer=default_optimizer):
local_time = [0.0] local_time = [0.0]
apply_time = {} apply_time = {}
apply_call = {} apply_call = {}
op_time = {}
op_cimpl = {} op_cimpl = {}
op_call = {}
compile_time = 0 #time passed in theano.function() compile_time = 0 #time passed in theano.function()
fct_call_time = {}#time passed inside theano fct call including op time. fct_call_time = {}#time passed inside theano fct call including op time.
fct_call = {} fct_call = {}
self.__setstate__((linker, optimizer, local_time, self.__setstate__((linker, optimizer, local_time,
apply_time, apply_call, apply_time, apply_call,
op_time, op_cimpl, op_call, op_cimpl,
compile_time, fct_call_time, fct_call)) compile_time, fct_call_time, fct_call))
def function_maker(self, i,o,m, *args, **kwargs):
"""Return an instance of `Profiler_Maker` which init the count"""
assert m is self
return Profile_Maker(i, o, self, *args, **kwargs)
def __getstate__(self): def __getstate__(self):
#print "__getstate__",self.provided_linker,self.provided_optimizer #print "__getstate__",self.provided_linker,self.provided_optimizer
return (self.provided_linker, self.provided_optimizer, self.local_time, return (self.provided_linker, self.provided_optimizer, self.local_time,
self.apply_time, self.apply_call, self.apply_time, self.apply_call,
self.op_time, self.op_cimpl, self.op_call, self.compile_time, self.fct_call_time, self.fct_call) self.op_cimpl, self.compile_time, self.fct_call_time, self.fct_call)
def __setstate__(self, (linker, optimizer, local_time, def __setstate__(self, (linker, optimizer, local_time,
apply_time, apply_call, apply_time, apply_call,
op_time, op_cimpl, op_call, op_cimpl,
compile_time, fct_call_time, fct_call)): compile_time, fct_call_time, fct_call)):
self.local_time = local_time self.local_time = local_time
self.apply_time = apply_time self.apply_time = apply_time
self.apply_call = apply_call self.apply_call = apply_call
self.op_time = op_time
self.op_cimpl = op_cimpl self.op_cimpl = op_cimpl
self.op_call = op_call
self.compile_time = compile_time self.compile_time = compile_time
self.fct_call_time = fct_call_time self.fct_call_time = fct_call_time
self.fct_call = fct_call self.fct_call = fct_call
self.call_time = 0
self.fn_time = 0
def blah(i, node, th): def blah(i, node, th):
if hasattr(th, 'cthunk'): if hasattr(th, 'cthunk'):
...@@ -72,11 +86,9 @@ class ProfileMode(Mode): ...@@ -72,11 +86,9 @@ class ProfileMode(Mode):
dt = time.time() - t0 dt = time.time() - t0
local_time[0] += dt local_time[0] += dt
apply_time[(i,node.op)] = apply_time.get((i,node.op), 0.0) + dt apply_time[(i,node.op)] += dt
apply_call[(i,node.op)] = apply_call.get((i,node.op), 0) + 1 apply_call[(i,node.op)] += 1
op_time[node.op] = op_time.get(node.op, 0.0) + dt
op_cimpl[node.op] = hasattr(th, 'cthunk') op_cimpl[node.op] = hasattr(th, 'cthunk')
op_call[node.op] = op_call.get(node.op,0) + 1
self.provided_linker = linker self.provided_linker = linker
...@@ -113,18 +125,11 @@ class ProfileMode(Mode): ...@@ -113,18 +125,11 @@ class ProfileMode(Mode):
fct_call = self.fct_call fct_call = self.fct_call
apply_time = self.apply_time apply_time = self.apply_time
apply_call = self.apply_call apply_call = self.apply_call
op_time = self.op_time
op_call = self.op_call
op_cimpl = self.op_cimpl op_cimpl = self.op_cimpl
op_flops = {}
for a,t in op_time.items():
if hasattr(a,'flops'):
op_flops[a]=a.flops*op_call[a]/t/1e6
self.print_summary_("print_summary",local_time, compile_time, fct_call_time, fct_call, self.print_summary_("print_summary",local_time, compile_time, fct_call_time, fct_call,
apply_time, apply_call, op_time, op_call, op_cimpl, apply_time, apply_call, op_cimpl,
op_flops, n_apply_to_print, n_ops_to_print) n_apply_to_print, n_ops_to_print)
def print_diff_summary(self, other, n_apply_to_print=15, n_ops_to_print=20): def print_diff_summary(self, other, n_apply_to_print=15, n_ops_to_print=20):
...@@ -153,42 +158,23 @@ class ProfileMode(Mode): ...@@ -153,42 +158,23 @@ class ProfileMode(Mode):
r[a]+=t r[a]+=t
return r return r
def diff_dict_flops(a_time,b_time_,a_call,b_call):
flops = {}
b_time = copy.copy(b_time_)
for a,ta in a_time.items():
tb = b_time.pop(a,0)
if hasattr(a,'flops'):
flops[a]=a.flops*a_call[a]/ta - a.flops*b_call[a]/tb/1e6
#they are missing in a
for b,tb in b_time.items():
if hasattr(b,'flops'):
flops[b]=b.flops*b_call[b]/tb/1e6
return flops
local_time = self.local_time[0]-other.local_time[0] local_time = self.local_time[0]-other.local_time[0]
compile_time = self.compile_time-other.compile_time compile_time = self.compile_time-other.compile_time
fct_call_time = diff_dict(self.fct_call_time,other.fct_call_time) fct_call_time = diff_dict(self.fct_call_time,other.fct_call_time)
fct_call = diff_dict(self.fct_call,other.fct_call) fct_call = diff_dict(self.fct_call,other.fct_call)
apply_time = diff_dict(self.apply_time, other.apply_time) apply_time = diff_dict(self.apply_time, other.apply_time)
apply_call = diff_dict(self.apply_call, other.apply_call) apply_call = diff_dict(self.apply_call, other.apply_call)
op_time = diff_dict(self.op_time, other.op_time)
op_call = diff_dict(self.op_call, other.op_call)
op_cimpl = self.op_cimpl and other.op_cimpl op_cimpl = self.op_cimpl and other.op_cimpl
op_flops = diff_dict_flops(self.op_time, other.op_time, self.op_call, other.op_call)
self.print_summary_("print_diff_summary",local_time, compile_time, fct_call_time, fct_call, self.print_summary_("print_diff_summary",local_time, compile_time, fct_call_time, fct_call,
apply_time, apply_call, op_time, op_call, op_cimpl, apply_time, apply_call, op_cimpl,
op_flops, n_apply_to_print=n_apply_to_print, n_apply_to_print=n_apply_to_print,
n_ops_to_print=n_ops_to_print, print_apply=False) n_ops_to_print=n_ops_to_print, print_apply=False)
@staticmethod @staticmethod
def print_summary_(fct_name, local_time, compile_time, fct_call_time, fct_call, def print_summary_(fct_name, local_time, compile_time, fct_call_time, fct_call,
apply_time, apply_call, op_time, op_call, op_cimpl, apply_time, apply_call, op_cimpl,
op_flops=None, n_apply_to_print=15, n_ops_to_print=20, print_apply=True): n_apply_to_print=15, n_ops_to_print=20, print_apply=True):
""" """
do the actual printing of print_summary and print_diff_summary. do the actual printing of print_summary and print_diff_summary.
...@@ -218,6 +204,19 @@ class ProfileMode(Mode): ...@@ -218,6 +204,19 @@ class ProfileMode(Mode):
sum(f for f, t, a, nb_call in atimes[n_apply_to_print:])*100, sum(f for f, t, a, nb_call in atimes[n_apply_to_print:])*100,
sum(t for f, t, a, nb_call in atimes[n_apply_to_print:])) sum(t for f, t, a, nb_call in atimes[n_apply_to_print:]))
op_time = {}
op_call = {}
for (i,a),t in apply_time.items():
op_time.setdefault(a,0)
op_call.setdefault(a,0)
op_time[a]+=t
op_call[a]+=apply_call[(i,a)]
op_flops = {}
for a,t in op_time.items():
if hasattr(a,'flops'):
op_flops[a]=a.flops*op_call[a]/t/1e6
flops_msg='' flops_msg=''
if op_flops: if op_flops:
flops_msg=' <MFlops/s>' flops_msg=' <MFlops/s>'
......
...@@ -544,35 +544,20 @@ class Test_check_isfinite(unittest.TestCase): ...@@ -544,35 +544,20 @@ class Test_check_isfinite(unittest.TestCase):
theano.tensor.TensorType.filter_checks_isfinite = self.old_val theano.tensor.TensorType.filter_checks_isfinite = self.old_val
def test_check_isfinite(self): def test_check_isfinite(self):
x = theano.tensor.dvector() x = theano.tensor.vector()
f = theano.function([x], (x+2) * 5, mode='DEBUG_MODE') f = theano.function([x], (x+2) * 5, mode='DEBUG_MODE')
g = theano.function([x], theano.tensor.log(x), mode='DEBUG_MODE')
# this should work # this should work
f(numpy.log([3, 4, 5])) f(numpy.log([3, 4, 5]))
# this should raise InvalidValueError # passing an invalid value as an input should trigger ValueError
try: self.failUnlessRaises(ValueError, f, numpy.log([3, -4, 5]))
# insert a NaN self.failUnlessRaises(ValueError, f, numpy.asarray([0, 1.0, 0])/0)
f(numpy.log([3, -4, 5])) self.failUnlessRaises(ValueError, f, numpy.asarray([1.0, 1.0, 1.0])/0)
assert False
except debugmode.InvalidValueError:
pass
# this should raise InvalidValueError
try:
# insert an Nan and Inf
f(numpy.asarray([0, 1.0, 0])/0)
assert False
except debugmode.InvalidValueError:
pass
# this should raise InvalidValueError # generating an invalid value internally should trigger InvalidValueError
try: self.failUnlessRaises(debugmode.InvalidValueError, g, [3,-4,5])
# insert several Inf
f(numpy.asarray([1.0, 1.0, 1.0])/0)
assert False
except debugmode.InvalidValueError:
pass
# this should disable the exception # this should disable the exception
theano.tensor.TensorType.filter_checks_isfinite = False theano.tensor.TensorType.filter_checks_isfinite = False
......
...@@ -222,7 +222,7 @@ class PureType(object): ...@@ -222,7 +222,7 @@ class PureType(object):
try: try:
self.filter(a, True) self.filter(a, True)
return True return True
except TypeError: except (TypeError, ValueError):
return False return False
def make_variable(self, name = None): def make_variable(self, name = None):
......
差异被折叠。
...@@ -5,14 +5,15 @@ from theano import config ...@@ -5,14 +5,15 @@ from theano import config
import logging, copy import logging, copy
_logger_name = 'theano.sandbox.cuda' _logger_name = 'theano.sandbox.cuda'
_logger = logging.getLogger(_logger_name) _logger = logging.getLogger(_logger_name)
_logger.setLevel(logging.INFO) _logger.setLevel(logging.WARNING)
_logger.addHandler(logging.StreamHandler()) def error(*msg):
_logger.warning('ERROR (%s): '% ( _logger_name, ' '.join(str(m) for m in msg)))
def warning(*msg): def warning(*msg):
_logger.warning(_logger_name+'WARNING: '+' '.join(str(m) for m in msg)) _logger.warning('WARNING (%s): '% ( _logger_name, ' '.join(str(m) for m in msg)))
def info(*msg): def info(*msg):
_logger.info(_logger_name+'INFO: '+' '.join(str(m) for m in msg)) _logger.warning('INFO (%s): '% ( _logger_name, ' '.join(str(m) for m in msg)))
def debug(*msg): def debug(*msg):
_logger.debug(_logger_name+'DEBUG: '+' '.join(str(m) for m in msg)) _logger.warning('DEBUG (%s): '% ( _logger_name, ' '.join(str(m) for m in msg)))
# Compile cuda_ndarray.cu # Compile cuda_ndarray.cu
...@@ -63,23 +64,32 @@ if not compile_cuda_ndarray: ...@@ -63,23 +64,32 @@ if not compile_cuda_ndarray:
except ImportError: except ImportError:
compile_cuda_ndarray = True compile_cuda_ndarray = True
if compile_cuda_ndarray: try:
import nvcc_compiler if compile_cuda_ndarray:
if not nvcc_compiler.is_nvcc_available(): import nvcc_compiler
set_cuda_disabled() if not nvcc_compiler.is_nvcc_available():
set_cuda_disabled()
if enable_cuda: if enable_cuda:
code = open(os.path.join(cuda_path, "cuda_ndarray.cu")).read() code = open(os.path.join(cuda_path, "cuda_ndarray.cu")).read()
if not os.path.exists(cuda_ndarray_loc): if not os.path.exists(cuda_ndarray_loc):
os.makedirs(cuda_ndarray_loc) os.makedirs(cuda_ndarray_loc)
nvcc_compiler.nvcc_module_compile_str('cuda_ndarray', code, location = cuda_ndarray_loc, nvcc_compiler.nvcc_module_compile_str('cuda_ndarray', code, location = cuda_ndarray_loc,
include_dirs=[cuda_path], libs=['cublas']) include_dirs=[cuda_path], libs=['cublas'])
from cuda_ndarray.cuda_ndarray import * from cuda_ndarray.cuda_ndarray import *
except Exception, e:
error( "Failed to compile cuda_ndarray.cu: %s" % str(e))
set_cuda_disabled()
if enable_cuda: if enable_cuda:
#check if their is an old cuda_ndarray that was loading instead of the one we compiled!
import cuda_ndarray.cuda_ndarray
if os.path.join(config.compiledir,'cuda_ndarray','cuda_ndarray.so')!=cuda_ndarray.cuda_ndarray.__file__:
_logger.warning("WARNING: cuda_ndarray was loaded from",cuda_ndarray.cuda_ndarray.__file__,"This is not expected as theano should compile it automatically for you. Do you have a directory called cuda_ndarray in your LD_LIBRARY_PATH environment variable? If so, please remove it as it is outdated!")
from theano.sandbox.cuda.type import CudaNdarrayType from theano.sandbox.cuda.type import CudaNdarrayType
from theano.sandbox.cuda.var import (CudaNdarrayVariable, from theano.sandbox.cuda.var import (CudaNdarrayVariable,
CudaNdarrayConstant, CudaNdarrayConstant,
...@@ -103,7 +113,7 @@ def use(device=config.device): ...@@ -103,7 +113,7 @@ def use(device=config.device):
raise ValueError("Invalid device identifier", device) raise ValueError("Invalid device identifier", device)
if use.device_number is None: if use.device_number is None:
# No successful call to use() has been made yet # No successful call to use() has been made yet
if device=="-1" or device=="CPU": if device<0:
return return
if device in [None,""]: if device in [None,""]:
device=0 device=0
...@@ -134,6 +144,5 @@ def handle_shared_float32(tf): ...@@ -134,6 +144,5 @@ def handle_shared_float32(tf):
else: else:
raise NotImplementedError('removing our handler') raise NotImplementedError('removing our handler')
if enable_cuda and config.device.startswith('gpu'): if enable_cuda and config.device.startswith('gpu'):
use() use()
...@@ -6,6 +6,13 @@ from theano import config ...@@ -6,6 +6,13 @@ from theano import config
_logger=logging.getLogger("theano.sandbox.cuda.nvcc_compiler") _logger=logging.getLogger("theano.sandbox.cuda.nvcc_compiler")
_logger.setLevel(logging.WARN) _logger.setLevel(logging.WARN)
from theano.configparser import config, AddConfigVar, StrParam
AddConfigVar('nvcc.compiler_bindir',
"if defined, nvcc compiler driver will seek g++ and gcc in this directory",
StrParam(""))
def error(*args): def error(*args):
#sys.stderr.write('ERROR:'+ ' '.join(str(a) for a in args)+'\n') #sys.stderr.write('ERROR:'+ ' '.join(str(a) for a in args)+'\n')
_logger.error("ERROR: "+' '.join(str(a) for a in args)) _logger.error("ERROR: "+' '.join(str(a) for a in args))
...@@ -68,6 +75,8 @@ def nvcc_module_compile_str(module_name, src_code, location=None, include_dirs=[ ...@@ -68,6 +75,8 @@ def nvcc_module_compile_str(module_name, src_code, location=None, include_dirs=[
debug('Generating shared lib', lib_filename) debug('Generating shared lib', lib_filename)
# TODO: Why do these args cause failure on gtx285 that has 1.3 compute capability? '--gpu-architecture=compute_13', '--gpu-code=compute_13', # TODO: Why do these args cause failure on gtx285 that has 1.3 compute capability? '--gpu-architecture=compute_13', '--gpu-code=compute_13',
cmd = ['nvcc', '-shared', '-g'] + [pa for pa in preargs if pa.startswith('-O')] cmd = ['nvcc', '-shared', '-g'] + [pa for pa in preargs if pa.startswith('-O')]
if config.nvcc.compiler_bindir:
cmd.extend(['--compiler-bindir', config.nvcc.compiler_bindir])
cmd.extend(['-Xcompiler', ','.join(pa for pa in preargs if not pa.startswith('-O'))]) cmd.extend(['-Xcompiler', ','.join(pa for pa in preargs if not pa.startswith('-O'))])
cmd.extend('-I%s'%idir for idir in include_dirs) cmd.extend('-I%s'%idir for idir in include_dirs)
cmd.extend(['-o',lib_filename]) cmd.extend(['-o',lib_filename])
......
...@@ -140,20 +140,20 @@ def test_elemwise1(): ...@@ -140,20 +140,20 @@ def test_elemwise1():
b = tensor.fmatrix() b = tensor.fmatrix()
#let debugmode catch any mistakes #let debugmode catch any mistakes
print >> sys.stderr, "STARTING FUNCTION 1" print >> sys.stdout, "STARTING FUNCTION 1"
f = pfunc([b], [], updates=[(a, b**a)], mode=mode_with_gpu) f = pfunc([b], [], updates=[(a, b**a)], mode=mode_with_gpu)
for i, node in enumerate(f.maker.env.toposort()): for i, node in enumerate(f.maker.env.toposort()):
print i, node print i, node
f(numpy.random.rand(*shape)+0.3) f(numpy.random.rand(*shape)+0.3)
print >> sys.stderr, "STARTING FUNCTION 2" print >> sys.stdout, "STARTING FUNCTION 2"
#let debugmode catch any mistakes #let debugmode catch any mistakes
f = pfunc([b], [], updates=[(a, tensor.exp(b**a))], mode=mode_with_gpu) f = pfunc([b], [], updates=[(a, tensor.exp(b**a))], mode=mode_with_gpu)
for i, node in enumerate(f.maker.env.toposort()): for i, node in enumerate(f.maker.env.toposort()):
print i, node print i, node
f(numpy.random.rand(*shape)+0.3) f(numpy.random.rand(*shape)+0.3)
print >> sys.stderr, "STARTING FUNCTION 3" print >> sys.stdout, "STARTING FUNCTION 3"
#let debugmode catch any mistakes #let debugmode catch any mistakes
f = pfunc([b], [], updates=[(a, a+b * tensor.exp(b**a))], mode=mode_with_gpu) f = pfunc([b], [], updates=[(a, a+b * tensor.exp(b**a))], mode=mode_with_gpu)
f(numpy.random.rand(*shape)+0.3) f(numpy.random.rand(*shape)+0.3)
...@@ -169,11 +169,11 @@ def test_elemwise2(): ...@@ -169,11 +169,11 @@ def test_elemwise2():
f = pfunc([b], [], updates=[(a, (a+b).dimshuffle(pattern))], mode=mode_with_gpu) f = pfunc([b], [], updates=[(a, (a+b).dimshuffle(pattern))], mode=mode_with_gpu)
has_elemwise = False has_elemwise = False
for i, node in enumerate(f.maker.env.toposort()): for i, node in enumerate(f.maker.env.toposort()):
print >> sys.stderr, i, node print >> sys.stdout, i, node
has_elemwise = has_elemwise or isinstance(node.op, tensor.Elemwise) has_elemwise = has_elemwise or isinstance(node.op, tensor.Elemwise)
assert not has_elemwise assert not has_elemwise
#let debugmode catch errors #let debugmode catch errors
print >> sys.stderr, 'pattern', pattern print >> sys.stdout, 'pattern', pattern
f(rng.rand(*shape)*.3) f(rng.rand(*shape)*.3)
shape = (3,4,5,6) shape = (3,4,5,6)
...@@ -204,7 +204,7 @@ def test_elemwise3(): ...@@ -204,7 +204,7 @@ def test_elemwise3():
b**a).dimshuffle([2,0,3,1]))], mode=mode_with_gpu) b**a).dimshuffle([2,0,3,1]))], mode=mode_with_gpu)
has_elemwise = False has_elemwise = False
for i, node in enumerate(f.maker.env.toposort()): for i, node in enumerate(f.maker.env.toposort()):
print >> sys.stderr, i, node print >> sys.stdout, i, node
has_elemwise = has_elemwise or isinstance(node.op, tensor.Elemwise) has_elemwise = has_elemwise or isinstance(node.op, tensor.Elemwise)
assert not has_elemwise assert not has_elemwise
#let debugmode catch errors #let debugmode catch errors
...@@ -220,7 +220,7 @@ def test_elemwise4(): ...@@ -220,7 +220,7 @@ def test_elemwise4():
f = pfunc([b,c], [], updates=[(a, (a+b.dimshuffle('x', 0)*c.dimshuffle(0, 'x')))], mode=mode_with_gpu) f = pfunc([b,c], [], updates=[(a, (a+b.dimshuffle('x', 0)*c.dimshuffle(0, 'x')))], mode=mode_with_gpu)
has_elemwise = False has_elemwise = False
for i, node in enumerate(f.maker.env.toposort()): for i, node in enumerate(f.maker.env.toposort()):
print >> sys.stderr, i, node print >> sys.stdout, i, node
has_elemwise = has_elemwise or isinstance(node.op, tensor.Elemwise) has_elemwise = has_elemwise or isinstance(node.op, tensor.Elemwise)
assert not has_elemwise assert not has_elemwise
#let debugmode catch errors #let debugmode catch errors
......
...@@ -360,7 +360,7 @@ def test_subsample(): ...@@ -360,7 +360,7 @@ def test_subsample():
def test_logical_shapes(): def test_logical_shapes():
# implement when # implement when
print >> sys.stderr, "INFO: test_logical_shapes not implemented (i.e. imshp_logical, kshp_logical, kshp_logical_top_aligned)" print >> sys.stderr, "WARNING TODO: test_logical_shapes not implemented (i.e. imshp_logical, kshp_logical, kshp_logical_top_aligned)"
def _test_dummy(): def _test_dummy():
......
...@@ -8,7 +8,7 @@ if cuda_ndarray.enable_cuda == False: ...@@ -8,7 +8,7 @@ if cuda_ndarray.enable_cuda == False:
import numpy import numpy
def test_host_to_device(): def test_host_to_device():
print >>sys.stderr, 'starting test_host_to_dev' print >>sys.stdout, 'starting test_host_to_dev'
for shape in ((), (3,), (2,3), (3,4,5,6)): for shape in ((), (3,), (2,3), (3,4,5,6)):
a = theano._asarray(numpy.random.rand(*shape), dtype='float32') a = theano._asarray(numpy.random.rand(*shape), dtype='float32')
b = cuda_ndarray.CudaNdarray(a) b = cuda_ndarray.CudaNdarray(a)
...@@ -53,7 +53,7 @@ def test_add(): ...@@ -53,7 +53,7 @@ def test_add():
def test_exp(): def test_exp():
print >>sys.stderr, 'starting test_exp' print >>sys.stdout, 'starting test_exp'
for shape in ((), (3,), (2,3), (1,10000000),(10,1000000), (100,100000),(1000,10000),(10000,1000)): for shape in ((), (3,), (2,3), (1,10000000),(10,1000000), (100,100000),(1000,10000),(10000,1000)):
a0 = theano._asarray(numpy.random.rand(*shape), dtype='float32') a0 = theano._asarray(numpy.random.rand(*shape), dtype='float32')
a1 = a0.copy() a1 = a0.copy()
...@@ -74,25 +74,25 @@ def test_exp(): ...@@ -74,25 +74,25 @@ def test_exp():
def test_copy(): def test_copy():
print >>sys.stderr, 'starting test_copy' print >>sys.stdout, 'starting test_copy'
shape = (5,) shape = (5,)
a = theano._asarray(numpy.random.rand(*shape), dtype='float32') a = theano._asarray(numpy.random.rand(*shape), dtype='float32')
print >>sys.stderr, '.. creating device object' print >>sys.stdout, '.. creating device object'
b = cuda_ndarray.CudaNdarray(a) b = cuda_ndarray.CudaNdarray(a)
print >>sys.stderr, '.. copy' print >>sys.stdout, '.. copy'
c = copy.copy(b) c = copy.copy(b)
print >>sys.stderr, '.. deepcopy' print >>sys.stdout, '.. deepcopy'
d = copy.deepcopy(b) d = copy.deepcopy(b)
print >>sys.stderr, '.. comparisons' print >>sys.stdout, '.. comparisons'
assert numpy.allclose(a, numpy.asarray(b)) assert numpy.allclose(a, numpy.asarray(b))
assert numpy.allclose(a, numpy.asarray(c)) assert numpy.allclose(a, numpy.asarray(c))
assert numpy.allclose(a, numpy.asarray(d)) assert numpy.allclose(a, numpy.asarray(d))
def test_dot(): def test_dot():
print >>sys.stderr, 'starting test_dot' print >>sys.stdout, 'starting test_dot'
a0 = theano._asarray(numpy.random.rand(4, 7), dtype='float32') a0 = theano._asarray(numpy.random.rand(4, 7), dtype='float32')
a1 = theano._asarray(numpy.random.rand(7, 6), dtype='float32') a1 = theano._asarray(numpy.random.rand(7, 6), dtype='float32')
...@@ -101,7 +101,7 @@ def test_dot(): ...@@ -101,7 +101,7 @@ def test_dot():
assert numpy.allclose(numpy.dot(a0, a1), cuda_ndarray.dot(b0, b1)) assert numpy.allclose(numpy.dot(a0, a1), cuda_ndarray.dot(b0, b1))
print >> sys.stderr, 'WARNING test_dot: not testing all 8 transpose cases of dot' print >> sys.stderr, 'WARNING TODO test_dot: not testing all 8 transpose cases of dot'
def test_sum(): def test_sum():
shape = (2,3) shape = (2,3)
...@@ -147,7 +147,7 @@ def test_reshape(): ...@@ -147,7 +147,7 @@ def test_reshape():
] ]
def subtest(shape_1, shape_2): def subtest(shape_1, shape_2):
#print >> sys.stderr, "INFO: shapes", shape_1, shape_2 #print >> sys.stdout, "INFO: shapes", shape_1, shape_2
a = theano._asarray(numpy.random.rand(*shape_1), dtype='float32') a = theano._asarray(numpy.random.rand(*shape_1), dtype='float32')
b = cuda_ndarray.CudaNdarray(a) b = cuda_ndarray.CudaNdarray(a)
......
...@@ -1125,7 +1125,7 @@ inv = Inv(upgrade_to_float, name = 'inv') ...@@ -1125,7 +1125,7 @@ inv = Inv(upgrade_to_float, name = 'inv')
class Log(UnaryScalarOp): class Log(UnaryScalarOp):
""" log base e """ """ log base e """
def impl(self, x): def impl(self, x):
return math.log(x) return numpy.log(x)
def grad(self, (x, ), (gz, )): def grad(self, (x, ), (gz, )):
if x.type in grad_types: if x.type in grad_types:
return gz / x, return gz / x,
......
...@@ -330,6 +330,7 @@ class TensorType(Type): ...@@ -330,6 +330,7 @@ class TensorType(Type):
self.broadcastable = tuple(broadcastable) self.broadcastable = tuple(broadcastable)
self.dtype_specs() # error checking is done there self.dtype_specs() # error checking is done there
self.name = name self.name = name
self.numpy_dtype = numpy.dtype(self.dtype)
if shape is None: if shape is None:
#backport self.shape = tuple((1 if b else None) for b in self.broadcastable) #backport self.shape = tuple((1 if b else None) for b in self.broadcastable)
l=[] l=[]
...@@ -360,16 +361,16 @@ class TensorType(Type): ...@@ -360,16 +361,16 @@ class TensorType(Type):
This function is not meant to be called in user code. It is for This function is not meant to be called in user code. It is for
`Linker` instances to use when running a compiled graph. `Linker` instances to use when running a compiled graph.
""" """
_data = data if (type(data) is numpy.ndarray) and (data.dtype is self.numpy_dtype):
if strict: pass # fall through to ndim check
elif strict:
# this is its own subcase that doesn't fall through to anything
if not isinstance(data, numpy.ndarray): if not isinstance(data, numpy.ndarray):
raise TypeError("%s expected a ndarray object.", data, type(data)) raise TypeError("%s expected a ndarray object.", data, type(data))
if not str(data.dtype) == self.dtype: if not str(data.dtype) == self.dtype:
raise TypeError("%s expected a ndarray object with dtype = %s (got %s)." % (self, self.dtype, data.dtype)) raise TypeError("%s expected a ndarray object with dtype = %s (got %s)." % (self, self.dtype, data.dtype))
if not data.ndim == self.ndim: if not data.ndim == self.ndim:
raise TypeError("%s expected a ndarray object with %s dimensions (got %s)." % (self, self.ndim, data.ndim)) raise TypeError("%s expected a ndarray object with %s dimensions (got %s)." % (self, self.ndim, data.ndim))
if self.filter_checks_isfinite and (not numpy.all(numpy.isfinite(data))):
raise TypeError("non-finite elements not allowed")
if TensorType.use_shape: if TensorType.use_shape:
for si, di in zip(self.shape, data.shape): for si, di in zip(self.shape, data.shape):
...@@ -378,11 +379,17 @@ class TensorType(Type): ...@@ -378,11 +379,17 @@ class TensorType(Type):
self, self.shape, data.shape)) self, self.shape, data.shape))
return data return data
else: else:
data = theano._asarray(data, dtype = self.dtype) data = theano._asarray(data, dtype = self.dtype) #TODO - consider to pad shape with ones
if not self.ndim == data.ndim: # to make it consistent with self.broadcastable... like vector->row type thing
if self.ndim != data.ndim:
raise TypeError("Wrong number of dimensions: expected %s, got %s with shape %s." % (self.ndim, data.ndim, data.shape), data) raise TypeError("Wrong number of dimensions: expected %s, got %s with shape %s." % (self.ndim, data.ndim, data.shape), data)
if any(b and d != 1 for d, b in zip(data.shape, self.broadcastable)): i = 0
raise TypeError("Non-unit value on shape on a broadcastable dimension.", data.shape, self.broadcastable) for b in self.broadcastable:
if b and data.shape[i] != 1:
raise TypeError("Non-unit value on shape on a broadcastable dimension.", data.shape, self.broadcastable)
i+=1
if self.filter_checks_isfinite and (not numpy.all(numpy.isfinite(data))):
raise ValueError("non-finite elements not allowed")
return data return data
def dtype_specs(self): def dtype_specs(self):
...@@ -1826,14 +1833,16 @@ class Default(gof.Op): ...@@ -1826,14 +1833,16 @@ class Default(gof.Op):
view_map = {0: [0]} view_map = {0: [0]}
def make_node(self, x, default): def make_node(self, x, default):
x, default = as_tensor_variable(x), as_tensor_variable(default) x, default = as_tensor_variable(x), as_tensor_variable(default)
assert x.type == default.type if x.type != default.type:
raise TypeError('Both default() arguments must have same type', x, default)
return gof.Apply(self, [x, default], [default.type()]) return gof.Apply(self, [x, default], [default.type()])
def perform(self, node, (x, default), (out, )): def perform(self, node, (x, default), (out, )):
if x is None: if x is None:
out[0] = default.copy() # why copy? Theano can't yet understand out[0] being a view of either x or y,
else: # so we can be a view of x, but only a copy of y.
out[0] = x out[0] = default.copy()
#backport out[0] = default.copy() if x is None else x else:
out[0] = x
default = Default() default = Default()
setdefault = default # legacy setdefault = default # legacy
......
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论