提交 0c12e4e5 authored 作者: Julien Rebetez's avatar Julien Rebetez

Merge branch 'master' into optdb_contains

...@@ -5,10 +5,15 @@ ...@@ -5,10 +5,15 @@
========================== ==========================
Frequently Asked Questions Frequently Asked Questions
========================== ==========================
Does Theano support Python 3?
------------------------------
We support both Python 2 >= 2.6 and Python 3 >= 3.3.
TypeError: object of type 'TensorVariable' has no len() TypeError: object of type 'TensorVariable' has no len()
------------------------------------------------------- -------------------------------------------------------
If you receive the following error, it is because the Python function *__len__* cannot If you receive the following error, it is because the Python function *__len__* cannot
be implemented on Theano variables: be implemented on Theano variables:
.. code-block:: python .. code-block:: python
...@@ -17,7 +22,7 @@ be implemented on Theano variables: ...@@ -17,7 +22,7 @@ be implemented on Theano variables:
Python requires that *__len__* returns an integer, yet it cannot be done as Theano's variables are symbolic. However, `var.shape[0]` can be used as a workaround. Python requires that *__len__* returns an integer, yet it cannot be done as Theano's variables are symbolic. However, `var.shape[0]` can be used as a workaround.
This error message cannot be made more explicit because the relevant aspects of Python's This error message cannot be made more explicit because the relevant aspects of Python's
internals cannot be modified. internals cannot be modified.
...@@ -64,10 +69,10 @@ compilation but it will also use more memory because ...@@ -64,10 +69,10 @@ compilation but it will also use more memory because
in a trade off between speed of compilation and memory usage. in a trade off between speed of compilation and memory usage.
Theano flag `reoptimize_unpickled_function` controls if an unpickled theano function Theano flag `reoptimize_unpickled_function` controls if an unpickled theano function
should reoptimize its graph or not. Theano users can use the standard python pickle should reoptimize its graph or not. Theano users can use the standard python pickle
tools to save a compiled theano function. When pickling, both graph before and tools to save a compiled theano function. When pickling, both graph before and
after the optimization are saved, including shared variables. When set to True, after the optimization are saved, including shared variables. When set to True,
the graph is reoptimized when being unpickled. Otherwise, skip the graph optimization the graph is reoptimized when being unpickled. Otherwise, skip the graph optimization
and use directly the optimized graph from the pickled file. After Theano 0.7, and use directly the optimized graph from the pickled file. After Theano 0.7,
the default changed to False. the default changed to False.
...@@ -197,7 +202,7 @@ We try to list in this `wiki page <https://github.com/Theano/Theano/wiki/Related ...@@ -197,7 +202,7 @@ We try to list in this `wiki page <https://github.com/Theano/Theano/wiki/Related
-------------------------------- --------------------------------
Theano offers a good amount of flexibility, but has some limitations too. Theano offers a good amount of flexibility, but has some limitations too.
You must answer for yourself the following question: How can my algorithm be cleverly written You must answer for yourself the following question: How can my algorithm be cleverly written
so as to make the most of what Theano can do? so as to make the most of what Theano can do?
Here is a list of some of the known limitations: Here is a list of some of the known limitations:
......
...@@ -20,11 +20,11 @@ instructions below for detailed installation steps): ...@@ -20,11 +20,11 @@ instructions below for detailed installation steps):
We develop mainly on 64-bit Linux machines. other architectures are We develop mainly on 64-bit Linux machines. other architectures are
not well-tested. not well-tested.
Python_ >= 2.6 Python_ 2 >= 2.6 or Python_ 3 >= 3.3
The development package (``python-dev`` or ``python-devel`` The development package (``python-dev`` or ``python-devel``
on most Linux distributions) is recommended (see just below). on most Linux distributions) is recommended (see just below).
Python 2.4 was supported up to and including the release 0.6. Python 2.4 was supported up to and including the release 0.6.
Python 3 is supported via 2to3 only, starting from 3.3. Python 3 is supported past the 3.3 release.
``g++``, ``python-dev`` ``g++``, ``python-dev``
Not technically required but *highly* recommended, in order to compile Not technically required but *highly* recommended, in order to compile
...@@ -147,7 +147,13 @@ by typing ...@@ -147,7 +147,13 @@ by typing
pip install Theano pip install Theano
You may need to add ``sudo`` before this command to install into your This should work under Python 2 or Python 3. To test, run
.. code-block:: bash
nosetests theano
You may need to add ``sudo`` before the ``pip`` command to install into your
system's ``site-packages`` directory. If you do not have administrator access system's ``site-packages`` directory. If you do not have administrator access
to your machine, you can install Theano locally (to ~/.local) using to your machine, you can install Theano locally (to ~/.local) using
...@@ -226,7 +232,7 @@ Bleeding-edge install instructions ...@@ -226,7 +232,7 @@ Bleeding-edge install instructions
If you are a developer of Theano, then check out the :ref:`dev_start_guide`. If you are a developer of Theano, then check out the :ref:`dev_start_guide`.
If you want the bleeding-edge without developing the code you can use pip for If you want the bleeding-edge without developing the code you can use pip for
this with the command line below. Note that it will also try to install Theano's dependencies this with the command line below. Note that it will also try to install Theano's dependencies
(like NumPy and SciPy), but not upgrade them. If you wish to upgrade them, (like NumPy and SciPy), but not upgrade them. If you wish to upgrade them,
remove the ``--no-deps`` switch to it, but go see a previous warning before doing this. remove the ``--no-deps`` switch to it, but go see a previous warning before doing this.
...@@ -255,11 +261,6 @@ From here, the easiest way to get started is (this requires setuptools_ or distr ...@@ -255,11 +261,6 @@ From here, the easiest way to get started is (this requires setuptools_ or distr
cd Theano cd Theano
python setup.py develop python setup.py develop
.. note::
"python setup.py develop ..." does not work on Python 3 as it does not call
the converter from Python 2 code to Python 3 code.
This will install a ``.pth`` file in your ``site-packages`` directory that This will install a ``.pth`` file in your ``site-packages`` directory that
tells Python where to look for your Theano installation (i.e. in the tells Python where to look for your Theano installation (i.e. in the
directory your just checked out of Github). Using ``develop`` mode is directory your just checked out of Github). Using ``develop`` mode is
......
...@@ -182,6 +182,7 @@ Here is the state of that vision as of December 3th, 2013 (after Theano release ...@@ -182,6 +182,7 @@ Here is the state of that vision as of December 3th, 2013 (after Theano release
* Possible implementation note: allow Theano Variable in the fgraph to * Possible implementation note: allow Theano Variable in the fgraph to
have more than 1 owner. have more than 1 owner.
* We support Python 2 and Python 3.
* We have a CUDA backend for tensors of type `float32` only. * We have a CUDA backend for tensors of type `float32` only.
* Efforts have begun towards a generic GPU ndarray (GPU tensor) (started in the * Efforts have begun towards a generic GPU ndarray (GPU tensor) (started in the
`libgpuarray <https://github.com/Theano/libgpuarray>`_ project) `libgpuarray <https://github.com/Theano/libgpuarray>`_ project)
......
...@@ -131,7 +131,7 @@ if __name__ == '__main__': ...@@ -131,7 +131,7 @@ if __name__ == '__main__':
inopt = [docpath, workdir] inopt = [docpath, workdir]
if files is not None: if files is not None:
inopt.extend(files) inopt.extend(files)
sphinx.main(['', '-b', builder] + extraopts + inopt) sphinx.build_main(['', '-b', builder] + extraopts + inopt)
if options['--all'] or options['--rst']: if options['--all'] or options['--rst']:
mkdir("doc") mkdir("doc")
......
...@@ -112,7 +112,8 @@ if config.device.startswith('gpu') or config.init_gpu_device.startswith('gpu'): ...@@ -112,7 +112,8 @@ if config.device.startswith('gpu') or config.init_gpu_device.startswith('gpu'):
if (config.device.startswith('cuda') or if (config.device.startswith('cuda') or
config.device.startswith('opencl') or config.device.startswith('opencl') or
config.init_gpu_device.startswith('cuda') or config.init_gpu_device.startswith('cuda') or
config.init_gpu_device.startswith('opencl')): config.init_gpu_device.startswith('opencl') or
config.contexts != ''):
import theano.sandbox.gpuarray import theano.sandbox.gpuarray
# Use config.numpy to call numpy.seterr # Use config.numpy to call numpy.seterr
......
...@@ -580,7 +580,7 @@ def debugprint(r, prefix='', depth=-1, done=None, print_type=False, ...@@ -580,7 +580,7 @@ def debugprint(r, prefix='', depth=-1, done=None, print_type=False,
print_view_map=False, order=None, ids='CHAR', print_view_map=False, order=None, ids='CHAR',
stop_on_name=False, prefix_child=None, stop_on_name=False, prefix_child=None,
scan_ops=None, profile=None, scan_ops=None, profile=None,
scan_inner_to_outer_inputs=None): scan_inner_to_outer_inputs=None, smap=None):
""" """
Print the graph leading to `r` to given depth. Print the graph leading to `r` to given depth.
...@@ -620,7 +620,8 @@ def debugprint(r, prefix='', depth=-1, done=None, print_type=False, ...@@ -620,7 +620,8 @@ def debugprint(r, prefix='', depth=-1, done=None, print_type=False,
scan_inner_to_outer_inputs scan_inner_to_outer_inputs
A dictionary mapping a scan ops inner function inputs to the scan op A dictionary mapping a scan ops inner function inputs to the scan op
inputs (outer inputs) for printing purposes. inputs (outer inputs) for printing purposes.
smap
None or the storage_map when printing an Theano function.
""" """
if depth == 0: if depth == 0:
return return
...@@ -689,23 +690,21 @@ def debugprint(r, prefix='', depth=-1, done=None, print_type=False, ...@@ -689,23 +690,21 @@ def debugprint(r, prefix='', depth=-1, done=None, print_type=False,
already_printed = a in done # get_id_str put it in the dict already_printed = a in done # get_id_str put it in the dict
id_str = get_id_str(a) id_str = get_id_str(a)
if len(a.outputs) == 1:
idx = ""
else:
idx = ".%i" % a.outputs.index(r)
data = ""
if smap:
data = " " + str(smap.get(a.outputs[0], ''))
if profile is None or a not in profile.apply_time: if profile is None or a not in profile.apply_time:
if len(a.outputs) == 1: print('%s%s%s %s%s \'%s\' %s %s %s%s' % (prefix, a.op,
print('%s%s %s%s \'%s\' %s %s %s' % (prefix, a.op, idx,
id_str, id_str, type_str,
type_str,
r_name, r_name,
destroy_map_str, destroy_map_str,
view_map_str, view_map_str,
o), file=file) o, data), file=file)
else:
print('%s%s.%i %s%s \'%s\' %s %s %s' % (prefix, a.op,
a.outputs.index(r),
id_str, type_str,
r_name,
destroy_map_str,
view_map_str,
o), file=file)
else: else:
op_time = profile.apply_time[a] op_time = profile.apply_time[a]
op_time_percent = (op_time / profile.fct_call_time) * 100 op_time_percent = (op_time / profile.fct_call_time) * 100
...@@ -714,31 +713,22 @@ def debugprint(r, prefix='', depth=-1, done=None, print_type=False, ...@@ -714,31 +713,22 @@ def debugprint(r, prefix='', depth=-1, done=None, print_type=False,
tot_time_percent = (tot_time_dict[a] / profile.fct_call_time) * 100 tot_time_percent = (tot_time_dict[a] / profile.fct_call_time) * 100
if len(a.outputs) == 1: if len(a.outputs) == 1:
print("%s%s %s%s '%s' %s %s %s --> " idx = ""
"%8.2es %4.1f%% %8.2es %4.1f%%"
% (prefix, a.op,
id_str,
type_str,
r_name,
destroy_map_str,
view_map_str,
o, op_time,
op_time_percent,
tot_time,
tot_time_percent), file=file)
else: else:
print("%s%s.%i %s%s '%s' %s %s %s --> " idx = ".%i" % a.outputs.index(r)
"%8.2es %4.1f%% %8.2es %4.1f%%" print("%s%s%s %s%s '%s' %s %s %s%s --> "
% (prefix, a.op, "%8.2es %4.1f%% %8.2es %4.1f%%"
a.outputs.index(r), % (prefix, a.op,
id_str, type_str, idx,
r_name, id_str, type_str,
destroy_map_str, r_name,
view_map_str, destroy_map_str,
o, op_time, view_map_str,
op_time_percent, o, data,
tot_time, op_time,
tot_time_percent), file=file) op_time_percent,
tot_time,
tot_time_percent), file=file)
if not already_printed: if not already_printed:
if (not stop_on_name or if (not stop_on_name or
...@@ -761,7 +751,8 @@ def debugprint(r, prefix='', depth=-1, done=None, print_type=False, ...@@ -761,7 +751,8 @@ def debugprint(r, prefix='', depth=-1, done=None, print_type=False,
ids=ids, stop_on_name=stop_on_name, ids=ids, stop_on_name=stop_on_name,
prefix_child=new_prefix_child, scan_ops=scan_ops, prefix_child=new_prefix_child, scan_ops=scan_ops,
profile=profile, profile=profile,
scan_inner_to_outer_inputs=scan_inner_to_outer_inputs) scan_inner_to_outer_inputs=scan_inner_to_outer_inputs,
smap=smap)
else: else:
if scan_inner_to_outer_inputs is not None and\ if scan_inner_to_outer_inputs is not None and\
r in scan_inner_to_outer_inputs: r in scan_inner_to_outer_inputs:
...@@ -777,8 +768,13 @@ def debugprint(r, prefix='', depth=-1, done=None, print_type=False, ...@@ -777,8 +768,13 @@ def debugprint(r, prefix='', depth=-1, done=None, print_type=False,
outer_id_str), file=file) outer_id_str), file=file)
else: else:
# this is an input variable # this is an input variable
data = ""
if smap:
data = " " + str(smap.get(r, ''))
id_str = get_id_str(r) id_str = get_id_str(r)
print('%s%s %s%s' % (prefix, r, id_str, type_str), file=file) print('%s%s %s%s%s' % (prefix, r, id_str,
type_str, data),
file=file)
return file return file
......
...@@ -91,6 +91,8 @@ exclude = [] ...@@ -91,6 +91,8 @@ exclude = []
if not theano.config.cxx: if not theano.config.cxx:
exclude = ['cxx_only'] exclude = ['cxx_only']
OPT_NONE = gof.Query(include=[], exclude=exclude) OPT_NONE = gof.Query(include=[], exclude=exclude)
# Even if multiple merge optimizer call will be there, this shouldn't
# impact performance.
OPT_MERGE = gof.Query(include=['merge'], exclude=exclude) OPT_MERGE = gof.Query(include=['merge'], exclude=exclude)
OPT_FAST_RUN = gof.Query(include=['fast_run'], exclude=exclude) OPT_FAST_RUN = gof.Query(include=['fast_run'], exclude=exclude)
OPT_FAST_RUN_STABLE = OPT_FAST_RUN.requiring('stable') OPT_FAST_RUN_STABLE = OPT_FAST_RUN.requiring('stable')
...@@ -113,7 +115,7 @@ OPT_STABILIZE.name = 'OPT_STABILIZE' ...@@ -113,7 +115,7 @@ OPT_STABILIZE.name = 'OPT_STABILIZE'
predefined_optimizers = { predefined_optimizers = {
None: OPT_NONE, None: OPT_NONE,
'None': OPT_NONE, 'None': OPT_NONE,
'merge': gof.MergeOptimizer(), 'merge': OPT_MERGE,
'fast_run': OPT_FAST_RUN, 'fast_run': OPT_FAST_RUN,
'fast_run_stable': OPT_FAST_RUN_STABLE, 'fast_run_stable': OPT_FAST_RUN_STABLE,
'fast_compile': OPT_FAST_COMPILE, 'fast_compile': OPT_FAST_COMPILE,
......
...@@ -25,3 +25,8 @@ def test_no_output_from_implace(): ...@@ -25,3 +25,8 @@ def test_no_output_from_implace():
fct_opt = theano.function([x, y], b, mode=mode_opt) fct_opt = theano.function([x, y], b, mode=mode_opt)
op = fct_opt.maker.fgraph.outputs[0].owner.op op = fct_opt.maker.fgraph.outputs[0].owner.op
assert (not hasattr(op, 'destroy_map') or 0 not in op.destroy_map) assert (not hasattr(op, 'destroy_map') or 0 not in op.destroy_map)
def test_including():
mode = theano.Mode(optimizer='merge')
mode.including('fast_compile')
...@@ -111,6 +111,29 @@ AddConfigVar( ...@@ -111,6 +111,29 @@ AddConfigVar(
BoolParam(False, allow_override=False), BoolParam(False, allow_override=False),
in_c_key=False) in_c_key=False)
class ContextsParam(ConfigParam):
def __init__(self):
def filter(val):
if val == '':
return val
for v in val.split(';'):
s = v.split('->')
if len(s) != 2:
raise ValueError("Malformed context map: %s" % (v,))
return val
ConfigParam.__init__(self, '', filter, False)
AddConfigVar(
'contexts',
"""
Context map for multi-gpu operation. Format is a
semicolon-separated list of names and device names in the
'name->dev_name' format. An example that would map name 'test' to
device 'cuda0' and name 'test2' to device 'opencl0:0' follows:
"test->cuda0;test2->opencl0:0".
""", ContextsParam(), in_c_key=False)
AddConfigVar( AddConfigVar(
'print_active_device', 'print_active_device',
"Print active device at when the GPU device is initialized.", "Print active device at when the GPU device is initialized.",
......
...@@ -32,7 +32,7 @@ class TestPyDotFormatter(unittest.TestCase): ...@@ -32,7 +32,7 @@ class TestPyDotFormatter(unittest.TestCase):
expected = 11 expected = 11
if th.config.mode == "FAST_COMPILE": if th.config.mode == "FAST_COMPILE":
expected = 12 expected = 12
self.assertEqual(len(graph.get_nodes()), 12) self.assertEqual(len(graph.get_nodes()), expected)
nc = self.node_counts(graph) nc = self.node_counts(graph)
if th.config.mode == "FAST_COMPILE": if th.config.mode == "FAST_COMPILE":
......
...@@ -547,9 +547,7 @@ class CLinker(link.Linker): ...@@ -547,9 +547,7 @@ class CLinker(link.Linker):
if no_recycling is None: if no_recycling is None:
no_recycling = [] no_recycling = []
if self.fgraph is not None and self.fgraph is not fgraph: if self.fgraph is not None and self.fgraph is not fgraph:
return type(self)().accept(fgraph, no_recycling) return type(self)(self.schedule).accept(fgraph, no_recycling)
# raise Exception("Cannot accept from a Linker that is already"
# " tied to another FunctionGraph.")
self.fgraph = fgraph self.fgraph = fgraph
self.fetch_variables() self.fetch_variables()
self.no_recycling = no_recycling self.no_recycling = no_recycling
...@@ -1284,6 +1282,34 @@ class CLinker(link.Linker): ...@@ -1284,6 +1282,34 @@ class CLinker(link.Linker):
c_compiler=self.c_compiler(), c_compiler=self.c_compiler(),
) )
def cmodule_key_variables(self, inputs, outputs, no_recycling,
compile_args=None, libraries=None,
header_dirs=None, insert_config_md5=True,
c_compiler=None):
# Assemble a dummy fgraph using the provided inputs and outputs. It is
# only used to compute the cmodule key so it only need to expose an
# `inputs` and an `outputs` attribute as well as a toposort() method
# which returns a deterministic result.
class FakeFunctionGraph():
def __init__(self, inputs, outputs):
self.inputs = inputs
self.outputs = outputs
def toposort(self):
# Calling io_toposort() here is fine because the results will
# only be used to compute the cmodule key which requires that
# the result of the toposort be deterministic. The ordering
# doesn't need to include information about inplace operations
# because that information will be included explicitly in
# cmodule_key_().
return graph.io_toposort(self.inputs, self.outputs)
fgraph = FakeFunctionGraph(inputs, outputs)
return self.cmodule_key_(fgraph, no_recycling, compile_args,
libraries, header_dirs, insert_config_md5,
c_compiler)
def cmodule_key_(self, fgraph, no_recycling, compile_args=None, def cmodule_key_(self, fgraph, no_recycling, compile_args=None,
libraries=None, header_dirs=None, insert_config_md5=True, libraries=None, header_dirs=None, insert_config_md5=True,
c_compiler=None): c_compiler=None):
...@@ -1425,8 +1451,15 @@ class CLinker(link.Linker): ...@@ -1425,8 +1451,15 @@ class CLinker(link.Linker):
fgraph_computed_set.update(node.outputs) fgraph_computed_set.update(node.outputs)
# Add not used input in the key # Add not used input in the key
# If inputs don't define a 'clients' attribute (as is the case if
# fgraph is not a real FunctionGraph but a FakeFunctionGraph, a
# lightweight class designed to imitate FunctionGraph), pretend they
# have none. This if fine because the goal is only to have all of the
# graph's information used to compute the key. If we mistakenly
# pretend that inputs with clients don't have any, were are only using
# those inputs more than once to compute the key.
for ipos, var in [(i, var) for i, var in enumerate(fgraph.inputs) for ipos, var in [(i, var) for i, var in enumerate(fgraph.inputs)
if not len(var.clients)]: if not len(getattr(var, 'clients', []))]:
sig.append((var.type, in_sig(var, -1, ipos))) sig.append((var.type, in_sig(var, -1, ipos)))
# crystalize the signature and version # crystalize the signature and version
...@@ -1720,7 +1753,8 @@ class OpWiseCLinker(link.LocalLinker): ...@@ -1720,7 +1753,8 @@ class OpWiseCLinker(link.LocalLinker):
return type(self)( return type(self)(
fallback_on_perform=self.fallback_on_perform, fallback_on_perform=self.fallback_on_perform,
allow_gc=self.allow_gc, allow_gc=self.allow_gc,
nice_errors=self.nice_errors nice_errors=self.nice_errors,
schedule=self.schedule,
).accept(fgraph, no_recycling) ).accept(fgraph, no_recycling)
# raise Exception("Cannot accept from a Linker that is # raise Exception("Cannot accept from a Linker that is
# already tied to another FunctionGraph.") # already tied to another FunctionGraph.")
...@@ -1873,7 +1907,8 @@ class DualLinker(link.Linker): ...@@ -1873,7 +1907,8 @@ class DualLinker(link.Linker):
if no_recycling is None: if no_recycling is None:
no_recycling = [] no_recycling = []
if self.fgraph is not None and self.fgraph is not fgraph: if self.fgraph is not None and self.fgraph is not fgraph:
return type(self)(self.checker).accept(fgraph, no_recycling) return type(self)(self.checker, self.schedule).accept(
fgraph, no_recycling)
self.fgraph = fgraph self.fgraph = fgraph
self.no_recycling = no_recycling self.no_recycling = no_recycling
return self return self
......
...@@ -17,6 +17,7 @@ import tempfile ...@@ -17,6 +17,7 @@ import tempfile
import time import time
import platform import platform
import distutils.sysconfig import distutils.sysconfig
import warnings
import numpy.distutils # TODO: TensorType should handle this import numpy.distutils # TODO: TensorType should handle this
...@@ -324,7 +325,10 @@ def dlimport(fullpath, suffix=None): ...@@ -324,7 +325,10 @@ def dlimport(fullpath, suffix=None):
if hasattr(importlib, "invalidate_caches"): if hasattr(importlib, "invalidate_caches"):
importlib.invalidate_caches() importlib.invalidate_caches()
t0 = time.time() t0 = time.time()
rval = __import__(module_name, {}, {}, [module_name]) with warnings.catch_warnings():
warnings.filterwarnings("ignore",
message="numpy.ndarray size changed")
rval = __import__(module_name, {}, {}, [module_name])
t1 = time.time() t1 = time.time()
import_time += t1 - t0 import_time += t1 - t0
if not rval: if not rval:
......
#! /usr/bin/env python
"""
This file compare the runtime of two independent dot products on one
and two GPU to measure the speedup.
This should be 2x if the GPUs are equivalent.
"""
import time
import numpy
import theano
from theano.sandbox.gpuarray import init_dev
from theano.sandbox.gpuarray.type import gpuarray_shared_constructor as shared
from theano.sandbox.gpuarray.blas import gpu_dot22
def main(dev1, dev2):
init_dev(dev1, 'ctx1')
init_dev(dev2, 'ctx2')
val1a = shared(numpy.random.randn(1024, 1024).astype('float32'),
context_name='ctx1')
val1b = shared(numpy.random.randn(1024, 1024).astype('float32'),
context_name='ctx1')
val1c = shared(numpy.random.randn(1024, 1024).astype('float32'),
context_name='ctx1')
val1d = shared(numpy.random.randn(1024, 1024).astype('float32'),
context_name='ctx1')
val2a = shared(numpy.random.randn(1024, 1024).astype('float32'),
context_name='ctx2')
val2b = shared(numpy.random.randn(1024, 1024).astype('float32'),
context_name='ctx2')
f1 = theano.function([], [gpu_dot22(val1a, val1b),
gpu_dot22(val1c, val1d)])
f2 = theano.function([], [gpu_dot22(val1a, val1b),
gpu_dot22(val2a, val2b)])
r = f1()
r[0].sync(), r[1].sync()
r = None
t = time.time()
r = f1()
r[0].sync(), r[1].sync()
t2 = time.time()
r = None
print("one ctx %f" % (t2 - t,))
r = f2()
r[0].sync(), r[1].sync()
r = None
t = time.time()
r = f2()
r[0].sync(), r[1].sync()
t2 = time.time()
r = None
print("two ctx %f" % (t2 - t,))
if __name__ == '__main__':
import sys
if len(sys.argv) != 3:
raise ValueError("This script require two device names.")
main(sys.argv[1], sys.argv[2])
...@@ -48,7 +48,7 @@ VALID_ASSOC = set(['left', 'right', 'either']) ...@@ -48,7 +48,7 @@ VALID_ASSOC = set(['left', 'right', 'either'])
def debugprint(obj, depth=-1, print_type=False, def debugprint(obj, depth=-1, print_type=False,
file=None, ids='CHAR', stop_on_name=False, file=None, ids='CHAR', stop_on_name=False,
done=None): done=None, print_storage=False):
"""Print a computation graph as text to stdout or a file. """Print a computation graph as text to stdout or a file.
:type obj: Variable, Apply, or Function instance :type obj: Variable, Apply, or Function instance
...@@ -70,6 +70,10 @@ def debugprint(obj, depth=-1, print_type=False, ...@@ -70,6 +70,10 @@ def debugprint(obj, depth=-1, print_type=False,
:type done: None or dict :type done: None or dict
:param done: A dict where we store the ids of printed node. :param done: A dict where we store the ids of printed node.
Useful to have multiple call to debugprint share the same ids. Useful to have multiple call to debugprint share the same ids.
:type print_storage: bool
:param print_storage: If True, this will print the storage map
for Theano functions. Combined with allow_gc=False, after the
execution of a Theano function, we see the intermediate result.
:returns: string if `file` == 'str', else file arg :returns: string if `file` == 'str', else file arg
...@@ -101,7 +105,8 @@ def debugprint(obj, depth=-1, print_type=False, ...@@ -101,7 +105,8 @@ def debugprint(obj, depth=-1, print_type=False,
done = dict() done = dict()
results_to_print = [] results_to_print = []
profile_list = [] profile_list = []
order = [] order = [] # Toposort
smap = [] # storage_map
if isinstance(obj, (list, tuple, set)): if isinstance(obj, (list, tuple, set)):
lobj = obj lobj = obj
else: else:
...@@ -110,24 +115,41 @@ def debugprint(obj, depth=-1, print_type=False, ...@@ -110,24 +115,41 @@ def debugprint(obj, depth=-1, print_type=False,
if isinstance(obj, gof.Variable): if isinstance(obj, gof.Variable):
results_to_print.append(obj) results_to_print.append(obj)
profile_list.append(None) profile_list.append(None)
smap.append(None)
order.append(None)
elif isinstance(obj, gof.Apply): elif isinstance(obj, gof.Apply):
results_to_print.extend(obj.outputs) results_to_print.extend(obj.outputs)
profile_list.extend([None for item in obj.outputs]) profile_list.extend([None for item in obj.outputs])
smap.extend([None for item in obj.outputs])
order.extend([None for item in obj.outputs])
elif isinstance(obj, Function): elif isinstance(obj, Function):
results_to_print.extend(obj.maker.fgraph.outputs) results_to_print.extend(obj.maker.fgraph.outputs)
profile_list.extend( profile_list.extend(
[obj.profile for item in obj.maker.fgraph.outputs]) [obj.profile for item in obj.maker.fgraph.outputs])
order = obj.maker.fgraph.toposort() if print_storage:
smap.extend(
[obj.fn.storage_map for item in obj.maker.fgraph.outputs])
else:
smap.extend(
[None for item in obj.maker.fgraph.outputs])
topo = obj.maker.fgraph.toposort()
order.extend(
[topo for item in obj.maker.fgraph.outputs])
elif isinstance(obj, gof.FunctionGraph): elif isinstance(obj, gof.FunctionGraph):
results_to_print.extend(obj.outputs) results_to_print.extend(obj.outputs)
profile_list.extend([getattr(obj, 'profile', None) profile_list.extend([getattr(obj, 'profile', None)
for item in obj.outputs]) for item in obj.outputs])
order = obj.toposort() smap.extend([getattr(obj, 'storage_map', None)
for item in obj.outputs])
topo = obj.toposort()
order.extend([topo for item in obj.outputs])
elif isinstance(obj, (integer_types, float, np.ndarray)): elif isinstance(obj, (integer_types, float, np.ndarray)):
print(obj) print(obj)
elif isinstance(obj, (theano.In, theano.Out)): elif isinstance(obj, (theano.In, theano.Out)):
results_to_print.append(obj.variable) results_to_print.append(obj.variable)
profile_list.append(None) profile_list.append(None)
smap.append(None)
order.append(None)
else: else:
raise TypeError("debugprint cannot print an object of this type", raise TypeError("debugprint cannot print an object of this type",
obj) obj)
...@@ -152,16 +174,16 @@ N.B.: ...@@ -152,16 +174,16 @@ N.B.:
to remove when optimizing a graph because their <total time> is very low. to remove when optimizing a graph because their <total time> is very low.
""", file=_file) """, file=_file)
for r, p in zip(results_to_print, profile_list): for r, p, s, o in zip(results_to_print, profile_list, smap, order):
# Add the parent scan op to the list as well # Add the parent scan op to the list as well
if (hasattr(r.owner, 'op') and if (hasattr(r.owner, 'op') and
isinstance(r.owner.op, theano.scan_module.scan_op.Scan)): isinstance(r.owner.op, theano.scan_module.scan_op.Scan)):
scan_ops.append(r) scan_ops.append(r)
debugmode.debugprint(r, depth=depth, done=done, print_type=print_type, debugmode.debugprint(r, depth=depth, done=done, print_type=print_type,
file=_file, order=order, ids=ids, file=_file, order=o, ids=ids,
scan_ops=scan_ops, stop_on_name=stop_on_name, scan_ops=scan_ops, stop_on_name=stop_on_name,
profile=p) profile=p, smap=s)
if len(scan_ops) > 0: if len(scan_ops) > 0:
print("", file=_file) print("", file=_file)
...@@ -996,7 +1018,11 @@ def pydotprint(fct, outfile=None, ...@@ -996,7 +1018,11 @@ def pydotprint(fct, outfile=None,
else: else:
new_name = basename + '_' + str(idx) new_name = basename + '_' + str(idx)
new_name = os.path.join(path, new_name + ext) new_name = os.path.join(path, new_name + ext)
pydotprint(scan_op.op.fn, new_name, compact, format, with_ids, if hasattr(scan_op.op, 'fn'):
to_print = scan_op.op.fn
else:
to_print = scan_op.op.outputs
pydotprint(to_print, new_name, compact, format, with_ids,
high_contrast, cond_highlight, colorCodes, high_contrast, cond_highlight, colorCodes,
max_label_size, scan_graphs) max_label_size, scan_graphs)
......
...@@ -92,10 +92,7 @@ class HostFromGpu(GpuOp): ...@@ -92,10 +92,7 @@ class HostFromGpu(GpuOp):
def R_op(self, inputs, eval_points): def R_op(self, inputs, eval_points):
ev, = eval_points ev, = eval_points
if isinstance(ev, tensor.TensorType): return self(ev)
return [gpu_from_host(ev)]
else:
return [ev]
def infer_shape(self, node, xshp): def infer_shape(self, node, xshp):
return xshp return xshp
...@@ -155,10 +152,7 @@ class GpuFromHost(GpuOp): ...@@ -155,10 +152,7 @@ class GpuFromHost(GpuOp):
def R_op(self, inputs, eval_points): def R_op(self, inputs, eval_points):
ev, = eval_points ev, = eval_points
if isinstance(ev, CudaNdarrayType): self(ev)
return [host_from_gpu(ev)]
else:
return [ev]
def infer_shape(self, node, xshp): def infer_shape(self, node, xshp):
return xshp return xshp
......
...@@ -142,8 +142,8 @@ void * device_malloc(size_t size, int verbose) ...@@ -142,8 +142,8 @@ void * device_malloc(size_t size, int verbose)
status = cnmemMalloc(&rval, size, NULL); status = cnmemMalloc(&rval, size, NULL);
if(status != CNMEM_STATUS_SUCCESS) { if(status != CNMEM_STATUS_SUCCESS) {
PyErr_Format(PyExc_MemoryError, PyErr_Format(PyExc_MemoryError,
"Error allocating %zd bytes of device memory (%s).", "Error allocating %llu bytes of device memory (%s).",
size, cnmemGetErrorString(status)); (unsigned long long)size, cnmemGetErrorString(status));
return NULL; return NULL;
} }
} }
...@@ -168,21 +168,21 @@ void * device_malloc(size_t size, int verbose) ...@@ -168,21 +168,21 @@ void * device_malloc(size_t size, int verbose)
} }
#if COMPUTE_GPU_MEM_USED #if COMPUTE_GPU_MEM_USED
fprintf(stderr, fprintf(stderr,
"Error allocating %zd bytes of device memory (%s)." "Error allocating %llu bytes of device memory (%s)."
" new total bytes allocated: %d." " new total bytes allocated: %llu."
" Driver report %zd bytes free and %zd bytes total \n", " Driver report %llu bytes free and %llu bytes total \n",
size, cudaGetErrorString(err), _allocated_size, (unsigned long long)size, cudaGetErrorString(err), (unsigned long long)_allocated_size,
free, total); (unsigned long long)free, (unsigned long long)total);
#else #else
fprintf(stderr, fprintf(stderr,
"Error allocating %zd bytes of device memory (%s)." "Error allocating %llu bytes of device memory (%s)."
" Driver report %zd bytes free and %zd bytes total \n", " Driver report %llu bytes free and %llu bytes total \n",
size, cudaGetErrorString(err), free, total); (unsigned long long)size, cudaGetErrorString(err), (unsigned long long)free, (unsigned long long)total);
#endif #endif
} }
PyErr_Format(PyExc_MemoryError, PyErr_Format(PyExc_MemoryError,
"Error allocating %zd bytes of device memory (%s).", "Error allocating %llu bytes of device memory (%s).",
size, cudaGetErrorString(err)); (unsigned long long)size, cudaGetErrorString(err));
return NULL; return NULL;
} }
} }
...@@ -310,17 +310,17 @@ int device_free(void *ptr) ...@@ -310,17 +310,17 @@ int device_free(void *ptr)
} }
assert(i<TABLE_SIZE); assert(i<TABLE_SIZE);
fprintf(stderr, fprintf(stderr,
"Error freeing device pointer %p (%s) of size %d. %zd byte already allocated." "Error freeing device pointer %p (%s) of size %llu. %llu byte already allocated."
" Driver report %zd bytes free and %zd bytes total \n", " Driver report %llu bytes free and %llu bytes total \n",
ptr, cudaGetErrorString(err), ptr, cudaGetErrorString(err),
_alloc_size_table[i].size, _allocated_size, free, total); (unsigned long long)_alloc_size_table[i].size, (unsigned long long)_allocated_size, (unsigned long long)free, (unsigned long long)total);
} }
#else #else
fprintf(stderr, fprintf(stderr,
"Error freeing device pointer %p (%s)." "Error freeing device pointer %p (%s)."
" Driver report %zd bytes free and %zd bytes total \n", " Driver report %llu bytes free and %llu bytes total \n",
ptr, ptr,
cudaGetErrorString(err), free, total); cudaGetErrorString(err), (unsigned long long)free, (unsigned long long)total);
#endif #endif
if (NULL != PyErr_Occurred()){ if (NULL != PyErr_Occurred()){
fprintf(stderr, fprintf(stderr,
......
...@@ -1765,10 +1765,6 @@ def dnn_pool(img, ws, stride=(1, 1), mode='max', pad=(0, 0)): ...@@ -1765,10 +1765,6 @@ def dnn_pool(img, ws, stride=(1, 1), mode='max', pad=(0, 0)):
bottom borders. bottom borders.
pad_w is the number of zero-valued pixels added to each of the left pad_w is the number of zero-valued pixels added to each of the left
and right borders. and right borders.
nd
Number of dimensions of pooling, can be 2 or 3 for 2d or 3d pooling
If set to 3 all other params (except mode) must have an extra
dimension to match. 3 is only available for cudnn v3
.. warning:: The cuDNN library only works with GPU that have a compute .. warning:: The cuDNN library only works with GPU that have a compute
capability of 3.0 or higer. This means that older GPU will not capability of 3.0 or higer. This means that older GPU will not
......
...@@ -2478,8 +2478,11 @@ def local_gpu_allocempty(node): ...@@ -2478,8 +2478,11 @@ def local_gpu_allocempty(node):
return False return False
def typeInfer(node):
return typeConstructor
optdb.register('gpu_scanOp_make_inplace', optdb.register('gpu_scanOp_make_inplace',
scan_opt.ScanInplaceOptimizer(typeConstructor=typeConstructor, scan_opt.ScanInplaceOptimizer(typeInfer=typeInfer,
gpu_flag=True), gpu_flag=True),
75, 75,
'gpu', 'gpu',
......
...@@ -279,8 +279,7 @@ def test_pooling(): ...@@ -279,8 +279,7 @@ def test_pooling():
a = f1(data).__array__() a = f1(data).__array__()
b = f2(data).__array__() b = f2(data).__array__()
assert numpy.allclose(a, b, utt.assert_allclose(a, b)
atol=numpy.finfo(numpy.float32).eps)
# Test the grad # Test the grad
for shp in [(1, 1, 2, 2), for shp in [(1, 1, 2, 2),
...@@ -338,7 +337,7 @@ def test_pooling(): ...@@ -338,7 +337,7 @@ def test_pooling():
assert any([isinstance(node.op, AveragePoolGrad) assert any([isinstance(node.op, AveragePoolGrad)
for node in fc.maker.fgraph.toposort()]) for node in fc.maker.fgraph.toposort()])
c_out = fc(data) c_out = fc(data)
assert numpy.allclose(c_out, g_out) utt.assert_allclose(c_out, g_out)
def test_pooling3d(): def test_pooling3d():
...@@ -443,7 +442,7 @@ def test_pooling3d(): ...@@ -443,7 +442,7 @@ def test_pooling3d():
fc = theano.function([x], theano.grad(out.sum(), x), fc = theano.function([x], theano.grad(out.sum(), x),
mode=mode_without_gpu) mode=mode_without_gpu)
c_out = fc(data) c_out = fc(data)
assert numpy.allclose(c_out, g_out) utt.assert_allclose(c_out, g_out)
def test_pooling_opt(): def test_pooling_opt():
...@@ -1357,8 +1356,10 @@ def test_conv3d_bwd(): ...@@ -1357,8 +1356,10 @@ def test_conv3d_bwd():
# Compare the results of the two implementations # Compare the results of the two implementations
res_ref = f_ref() res_ref = f_ref()
res = f() res = f()
utt.assert_allclose(res_ref[0], res[0]) # Needed for big size for some seed
utt.assert_allclose(res_ref[1], res[1]) # raise rtol to make the test pass with more seed.
utt.assert_allclose(res_ref[0], res[0], rtol=2e-5)
utt.assert_allclose(res_ref[1], res[1], rtol=2e-5)
test_cases = get_conv3d_test_cases() test_cases = get_conv3d_test_cases()
for (i_shape, f_shape, subsample), border_mode, conv_mode in test_cases: for (i_shape, f_shape, subsample), border_mode, conv_mode in test_cases:
......
...@@ -21,26 +21,30 @@ except ImportError: ...@@ -21,26 +21,30 @@ except ImportError:
# This is for documentation not to depend on the availability of pygpu # This is for documentation not to depend on the availability of pygpu
from .type import (GpuArrayType, GpuArrayVariable, GpuArrayConstant, from .type import (GpuArrayType, GpuArrayVariable, GpuArrayConstant,
GpuArraySharedVariable, gpuarray_shared_constructor) GpuArraySharedVariable, gpuarray_shared_constructor,
reg_context)
from . import opt, nerv from . import opt, nerv
def init_dev(dev): def init_dev(dev, name=None):
if pygpu.gpuarray.api_version() != (-10000, 0): if pygpu.gpuarray.api_version() != (-10000, 0):
raise RuntimeError("Wrong API version for gpuarray:", raise RuntimeError("Wrong API version for gpuarray:",
pygpu.gpuarray.api_version(), pygpu.gpuarray.api_version(),
"Make sure Theano and libgpuarray/pygpu " "Make sure Theano and libgpuarray/pygpu "
"are in sync.") "are in sync.")
global pygpu_activated global pygpu_activated
context = pygpu.init(dev) if dev not in init_dev.devmap:
pygpu.set_default_context(context) init_dev.devmap[dev] = pygpu.init(dev)
context = init_dev.devmap[dev]
# This will map the context name to the real context object.
reg_context(name, context)
pygpu_activated = True pygpu_activated = True
if config.print_active_device: if config.print_active_device:
print("Using device %s: %s" % (dev, context.devname), file=sys.stderr) print("Mapped name %s to device %s: %s" % (name, dev, context.devname),
# remember the active device file=sys.stderr)
init_dev.device = dev
init_dev.device = None # This maps things like 'cuda0' to the context object on that device.
init_dev.devmap = {}
if pygpu: if pygpu:
try: try:
...@@ -52,11 +56,21 @@ if pygpu: ...@@ -52,11 +56,21 @@ if pygpu:
optdb.add_tags('gpuarray_opt', 'fast_run', 'fast_compile') optdb.add_tags('gpuarray_opt', 'fast_run', 'fast_compile')
elif (config.init_gpu_device.startswith('cuda') or elif (config.init_gpu_device.startswith('cuda') or
config.init_gpu_device.startswith('opencl')): config.init_gpu_device.startswith('opencl')):
if config.device != 'cpu':
raise ValueError('you must set device=cpu to use init_gpu_device.')
if config.contexts != '':
print("Using contexts will make init_gpu_device act like device and move all computations by default, which might not be what you want.")
init_dev(config.init_gpu_device) init_dev(config.init_gpu_device)
if config.contexts != '':
for n, d in (c.split('->') for c in config.contexts.split(';')):
init_dev(d.strip(), n.strip())
import theano.compile
theano.compile.shared_constructor(gpuarray_shared_constructor)
optdb.add_tags('gpuarray_opt', 'fast_run', 'fast_compile')
from .basic_ops import (GpuAlloc, GpuContiguous, GpuEye, GpuFromHost, from .basic_ops import (GpuAlloc, GpuContiguous, GpuEye, GpuFromHost,
GpuJoin, GpuReshape, GpuSplit, HostFromGpu) GpuJoin, GpuReshape, GpuSplit, HostFromGpu)
from .basic_ops import host_from_gpu, gpu_from_host from .basic_ops import host_from_gpu, GpuFromHost
from .elemwise import GpuElemwise from .elemwise import GpuElemwise
from .subtensor import (GpuSubtensor, GpuIncSubtensor, from .subtensor import (GpuSubtensor, GpuIncSubtensor,
GpuAdvancedIncSubtensor1) GpuAdvancedIncSubtensor1)
...@@ -67,5 +81,6 @@ else: ...@@ -67,5 +81,6 @@ else:
if (config.init_gpu_device.startswith('cuda') or if (config.init_gpu_device.startswith('cuda') or
config.init_gpu_device.startswith('opencl') or config.init_gpu_device.startswith('opencl') or
config.device.startswith('opencl') or config.device.startswith('opencl') or
config.device.startswith('cuda')): config.device.startswith('cuda') or
config.contexts != ''):
error("pygpu was configured but could not be imported", exc_info=True) error("pygpu was configured but could not be imported", exc_info=True)
import os.path import os.path
from theano import Apply, config from theano import Apply, config, Op
from theano.compile import optdb from theano.compile import optdb
from theano.gof import local_optimizer, LocalOptGroup from theano.gof import LocalOptGroup
from theano.tensor.basic import as_tensor_variable from theano.tensor.basic import as_tensor_variable
from theano.tensor.blas import Dot22, Gemv, Gemm, Ger
from theano.tensor.opt import in2out from theano.tensor.opt import in2out
from .basic_ops import HideC, as_gpuarray_variable, GpuAllocEmpty from .basic_ops import as_gpuarray_variable, infer_context_name
from .opt_util import inplace_allocempty
try: try:
import pygpu import pygpu
...@@ -18,7 +19,7 @@ except ImportError as e: ...@@ -18,7 +19,7 @@ except ImportError as e:
pass pass
class BlasOp(HideC): class BlasOp(Op):
def c_headers(self): def c_headers(self):
return ['<blas_api.h>', '<numpy_compat.h>', '<gpuarray_helper.h>'] return ['<blas_api.h>', '<numpy_compat.h>', '<gpuarray_helper.h>']
...@@ -28,34 +29,27 @@ class BlasOp(HideC): ...@@ -28,34 +29,27 @@ class BlasOp(HideC):
def c_init_code(self): def c_init_code(self):
return ['import_pygpu__blas();'] return ['import_pygpu__blas();']
def c_support_code(self):
return """ class GpuGemv(BlasOp):
PyGpuArrayObject *gpublas_try_copy(PyGpuArrayObject *out, __props__ = ('inplace',)
PyGpuArrayObject *y) {
if (out && def __init__(self, inplace=False):
GpuArray_CHKFLAGS(&out->ga, GA_CARRAY) && self.inplace = inplace
theano_size_check(out, PyGpuArray_NDIM(y), if self.inplace:
PyGpuArray_DIMS(y), self.destroy_map = {0: [0]}
y->ga.typecode)) {
if (pygpu_move(out, y)) {
Py_XDECREF(out);
return NULL;
}
} else {
Py_XDECREF(out);
out = pygpu_copy(y, GA_ANY_ORDER);
}
return out;
}
"""
class GpuGemv(BlasOp, Gemv):
def make_node(self, y, alpha, A, x, beta): def make_node(self, y, alpha, A, x, beta):
Gemv.make_node(self, y, alpha, A, x, beta) ctx_name = infer_context_name(y, A, x)
A = as_gpuarray_variable(A) A = as_gpuarray_variable(A, ctx_name)
x = as_gpuarray_variable(x) x = as_gpuarray_variable(x, ctx_name)
y = as_gpuarray_variable(y) y = as_gpuarray_variable(y, ctx_name)
alpha = as_tensor_variable(alpha)
beta = as_tensor_variable(beta)
assert alpha.ndim == 0
assert beta.ndim == 0
assert A.ndim == 2
assert x.ndim == 1
assert y.ndim == 1
assert A.dtype == x.dtype == y.dtype assert A.dtype == x.dtype == y.dtype
return Apply(self, [y, alpha, A, x, beta], [y.type()]) return Apply(self, [y, alpha, A, x, beta], [y.type()])
...@@ -73,7 +67,7 @@ class GpuGemv(BlasOp, Gemv): ...@@ -73,7 +67,7 @@ class GpuGemv(BlasOp, Gemv):
if self.inplace: if self.inplace:
code = """ code = """
if (%(y)s->ga.strides[0] <= 0) { if (%(y)s->ga.strides[0] <= 0) {
%(out)s = gpublas_try_copy(%(out)s, %(y)s); %(out)s = theano_try_copy(%(out)s, %(y)s);
if (%(out)s == NULL) { if (%(out)s == NULL) {
%(fail)s %(fail)s
} }
...@@ -85,7 +79,7 @@ class GpuGemv(BlasOp, Gemv): ...@@ -85,7 +79,7 @@ class GpuGemv(BlasOp, Gemv):
""" % vars """ % vars
else: else:
code = """ code = """
%(out)s = gpublas_try_copy(%(out)s, %(y)s); %(out)s = theano_try_copy(%(out)s, %(y)s);
if (%(out)s == NULL) { if (%(out)s == NULL) {
%(fail)s %(fail)s
} }
...@@ -106,21 +100,33 @@ class GpuGemv(BlasOp, Gemv): ...@@ -106,21 +100,33 @@ class GpuGemv(BlasOp, Gemv):
return code return code
def c_code_cache_version(self): def c_code_cache_version(self):
return (3,) return (4,)
gpugemv_no_inplace = GpuGemv(inplace=False) gpugemv_no_inplace = GpuGemv(inplace=False)
gpugemv_inplace = GpuGemv(inplace=True) gpugemv_inplace = GpuGemv(inplace=True)
class GpuGemm(BlasOp, Gemm): class GpuGemm(BlasOp):
__props__ = ('inplace',)
_f16_ok = True _f16_ok = True
def __init__(self, inplace=False):
self.inplace = inplace
if self.inplace:
self.destroy_map = {0: [0]}
def make_node(self, C, alpha, A, B, beta): def make_node(self, C, alpha, A, B, beta):
ctx_name = infer_context_name(C, A, B)
A = as_gpuarray_variable(A, ctx_name)
B = as_gpuarray_variable(B, ctx_name)
C = as_gpuarray_variable(C, ctx_name)
alpha = as_tensor_variable(alpha) alpha = as_tensor_variable(alpha)
beta = as_tensor_variable(beta) beta = as_tensor_variable(beta)
A = as_gpuarray_variable(A) assert alpha.ndim == 0
B = as_gpuarray_variable(B) assert beta.ndim == 0
C = as_gpuarray_variable(C) assert A.ndim == 2
assert B.ndim == 2
assert C.ndim == 2
assert A.dtype == B.dtype == C.dtype assert A.dtype == B.dtype == C.dtype
return Apply(self, [C, alpha, A, B, beta], [C.type()]) return Apply(self, [C, alpha, A, B, beta], [C.type()])
...@@ -138,7 +144,7 @@ class GpuGemm(BlasOp, Gemm): ...@@ -138,7 +144,7 @@ class GpuGemm(BlasOp, Gemm):
if self.inplace: if self.inplace:
code = """ code = """
if (!GpuArray_ISONESEGMENT(&%(C)s->ga)) { if (!GpuArray_ISONESEGMENT(&%(C)s->ga)) {
%(out)s = gpublas_try_copy(%(out)s, %(C)s); %(out)s = theano_try_copy(%(out)s, %(C)s);
if (%(out)s == NULL) { if (%(out)s == NULL) {
%(fail)s %(fail)s
} }
...@@ -150,7 +156,7 @@ class GpuGemm(BlasOp, Gemm): ...@@ -150,7 +156,7 @@ class GpuGemm(BlasOp, Gemm):
""" % vars """ % vars
else: else:
code = """ code = """
%(out)s = gpublas_try_copy(%(out)s, %(C)s); %(out)s = theano_try_copy(%(out)s, %(C)s);
if (%(out)s == NULL) { if (%(out)s == NULL) {
%(fail)s %(fail)s
} }
...@@ -171,25 +177,36 @@ class GpuGemm(BlasOp, Gemm): ...@@ -171,25 +177,36 @@ class GpuGemm(BlasOp, Gemm):
return code return code
def c_code_cache_version(self): def c_code_cache_version(self):
return (4,) return (5,)
gpugemm_no_inplace = GpuGemm(inplace=False) gpugemm_no_inplace = GpuGemm(inplace=False)
gpugemm_inplace = GpuGemm(inplace=True) gpugemm_inplace = GpuGemm(inplace=True)
class GpuGer(BlasOp, Ger): class GpuGer(BlasOp):
__props__ = ('inplace',)
def __init__(self, inplace=False):
self.inplace = inplace
if self.inplace:
self.destroy_map = {0: [0]}
def make_node(self, A, alpha, x, y): def make_node(self, A, alpha, x, y):
Ger.make_node(self, A, alpha, x, y) ctx_name = infer_context_name(A, x, y)
A = as_gpuarray_variable(A) A = as_gpuarray_variable(A, ctx_name)
x = as_gpuarray_variable(x) x = as_gpuarray_variable(x, ctx_name)
y = as_gpuarray_variable(y) y = as_gpuarray_variable(y, ctx_name)
alpha = as_tensor_variable(alpha)
assert alpha.ndim == 0
assert A.ndim == 2
assert x.ndim == 1
assert y.ndim == 1
assert A.dtype == x.dtype == y.dtype assert A.dtype == x.dtype == y.dtype
return Apply(self, [A, alpha, x, y], [A.type()]) return Apply(self, [A, alpha, x, y], [A.type()])
def perform(self, node, inp, out): def perform(self, node, inp, out):
A, alpha, x, y = inp A, alpha, x, y = inp
inplace = self.destructive inplace = self.inplace
if inplace and not A.flags.forc: if inplace and not A.flags.forc:
inplace = False inplace = False
out[0][0] = blas.ger(alpha, x, y, A, out[0][0] = blas.ger(alpha, x, y, A,
...@@ -198,10 +215,10 @@ class GpuGer(BlasOp, Ger): ...@@ -198,10 +215,10 @@ class GpuGer(BlasOp, Ger):
def c_code(self, node, name, inp, out, sub): def c_code(self, node, name, inp, out, sub):
vars = dict(out=out[0], A=inp[0], alpha=inp[1], x=inp[2], y=inp[3], vars = dict(out=out[0], A=inp[0], alpha=inp[1], x=inp[2], y=inp[3],
fail=sub['fail'], name=name) fail=sub['fail'], name=name)
if self.destructive: if self.inplace:
code = """ code = """
if (!GpuArray_ISONESEGMENT(&%(A)s->ga)) { if (!GpuArray_ISONESEGMENT(&%(A)s->ga)) {
%(out)s = gpublas_try_copy(%(out)s, %(A)s); %(out)s = theano_try_copy(%(out)s, %(A)s);
if (%(out)s == NULL) { if (%(out)s == NULL) {
%(fail)s %(fail)s
} }
...@@ -213,7 +230,7 @@ class GpuGer(BlasOp, Ger): ...@@ -213,7 +230,7 @@ class GpuGer(BlasOp, Ger):
""" % vars """ % vars
else: else:
code = """ code = """
%(out)s = gpublas_try_copy(%(out)s, %(A)s); %(out)s = theano_try_copy(%(out)s, %(A)s);
if (%(out)s == NULL) { if (%(out)s == NULL) {
%(fail)s %(fail)s
} }
...@@ -231,18 +248,22 @@ class GpuGer(BlasOp, Ger): ...@@ -231,18 +248,22 @@ class GpuGer(BlasOp, Ger):
return code return code
def c_code_cache_version(self): def c_code_cache_version(self):
return (2,) return (3,)
gpuger_no_inplace = GpuGer(inplace=False)
gpuger_inplace = GpuGer(inplace=True)
gpuger_no_inplace = GpuGer(destructive=False)
gpuger_inplace = GpuGer(destructive=True)
class GpuDot22(BlasOp):
__props__ = ()
class GpuDot22(BlasOp, Dot22):
def make_node(self, x, y): def make_node(self, x, y):
Dot22.make_node(self, x, y) ctx_name = infer_context_name(x, y)
x = as_gpuarray_variable(x) x = as_gpuarray_variable(x, ctx_name)
y = as_gpuarray_variable(y) y = as_gpuarray_variable(y, ctx_name)
assert x.ndim == 2
assert y.ndim == 2
assert x.dtype == y.dtype assert x.dtype == y.dtype
return Apply(self, [x, y], [x.type()]) return Apply(self, [x, y], [x.type()])
...@@ -268,7 +289,7 @@ class GpuDot22(BlasOp, Dot22): ...@@ -268,7 +289,7 @@ class GpuDot22(BlasOp, Dot22):
dims[1] = PyGpuArray_DIMS(%(B)s)[1]; dims[1] = PyGpuArray_DIMS(%(B)s)[1];
if (theano_prep_output(&%(out)s, 2, dims, %(typecode)s, GA_C_ORDER, if (theano_prep_output(&%(out)s, 2, dims, %(typecode)s, GA_C_ORDER,
pygpu_default_context())) { %(A)s->context)) {
%(fail)s %(fail)s
} }
...@@ -287,32 +308,24 @@ class GpuDot22(BlasOp, Dot22): ...@@ -287,32 +308,24 @@ class GpuDot22(BlasOp, Dot22):
return code return code
def c_code_cache_version(self): def c_code_cache_version(self):
return (3,) return (4,)
gpu_dot22 = GpuDot22() gpu_dot22 = GpuDot22()
@local_optimizer([gpugemv_no_inplace], inplace=True) @inplace_allocempty(GpuGemv, 0)
def local_inplace_gpuagemv(node): def local_inplace_gpuagemv(node, inputs):
if node.op == gpugemv_no_inplace: return [gpugemv_inplace(*inputs)]
return [gpugemv_inplace(*node.inputs)]
@local_optimizer([gpugemm_no_inplace], inplace=True) @inplace_allocempty(GpuGemm, 0)
def local_inplace_gpuagemm(node): def local_inplace_gpuagemm(node, inputs):
if node.op == gpugemm_no_inplace: return [gpugemm_inplace(*inputs)]
inputs = list(node.inputs)
C = inputs[0]
if (C.owner and isinstance(C.owner.op, GpuAllocEmpty) and
len(C.clients) > 1):
inputs[0] = C.owner.op(*C.owner.inputs)
return [gpugemm_inplace(*inputs)]
@local_optimizer([gpuger_no_inplace], inplace=True) @inplace_allocempty(GpuGer, 0)
def local_inplace_gpuager(node): def local_inplace_gpuager(node, inputs):
if node.op == gpuger_no_inplace: return [gpuger_inplace(*inputs)]
return [gpuger_inplace(*node.inputs)]
gpuablas_opt_inplace = in2out(LocalOptGroup(local_inplace_gpuagemv, gpuablas_opt_inplace = in2out(LocalOptGroup(local_inplace_gpuagemv,
local_inplace_gpuagemm, local_inplace_gpuagemm,
......
import copy import copy
import os import os
import theano from theano import gof
from theano import config, gof
try: try:
from pygpu import gpuarray from pygpu import gpuarray
...@@ -10,7 +9,8 @@ except ImportError: ...@@ -10,7 +9,8 @@ except ImportError:
pass pass
from .type import GpuArrayType from .type import GpuArrayType
from .basic_ops import as_gpuarray_variable, GpuKernelBase, Kernel from .basic_ops import (as_gpuarray_variable, GpuKernelBase, Kernel,
infer_context_name)
from theano.gof import utils from theano.gof import utils
...@@ -58,6 +58,9 @@ class GpuConv(GpuKernelBase, gof.Op): ...@@ -58,6 +58,9 @@ class GpuConv(GpuKernelBase, gof.Op):
them. them.
""" """
__props__ = ('border_mode', 'subsample', 'logical_img_hw',
'logical_kern_hw', 'logical_kern_align_top', 'version',
'verbose', 'kshp', 'imshp', 'max_threads_dim0')
@staticmethod @staticmethod
def logical_output_shape_2d(imshp, kshp, mode): def logical_output_shape_2d(imshp, kshp, mode):
...@@ -67,20 +70,13 @@ class GpuConv(GpuKernelBase, gof.Op): ...@@ -67,20 +70,13 @@ class GpuConv(GpuKernelBase, gof.Op):
return imshp[0] + kshp[0] - 1, imshp[1] + kshp[1] - 1 return imshp[0] + kshp[0] - 1, imshp[1] + kshp[1] - 1
raise ValueError(mode) raise ValueError(mode)
def __init__(self, border_mode, def __init__(self, border_mode, subsample=(1, 1),
subsample=(1, 1), logical_img_hw=None, logical_kern_hw=None,
logical_img_hw=None,
logical_kern_hw=None,
logical_kern_align_top=True, logical_kern_align_top=True,
version=-1, version=-1, direction_hint=None,
direction_hint=None, verbose=0, kshp=None, imshp=None,
verbose=0,
kshp=None,
imshp=None,
max_threads_dim0=None, max_threads_dim0=None,
nkern=None, nkern=None, bsize=None, fft_opt=True):
bsize=None,
fft_opt=True):
self.border_mode = border_mode self.border_mode = border_mode
self.subsample = subsample self.subsample = subsample
if logical_img_hw is not None: if logical_img_hw is not None:
...@@ -108,19 +104,6 @@ class GpuConv(GpuKernelBase, gof.Op): ...@@ -108,19 +104,6 @@ class GpuConv(GpuKernelBase, gof.Op):
self.bsize = bsize self.bsize = bsize
self.fft_opt = fft_opt self.fft_opt = fft_opt
def __eq__(self, other):
return type(self) == type(other) \
and self.border_mode == other.border_mode \
and self.subsample == other.subsample \
and self.logical_img_hw == other.logical_img_hw \
and self.logical_kern_hw == other.logical_kern_hw \
and self.logical_kern_align_top == other.logical_kern_align_top \
and self.version == other.version \
and self.verbose == other.verbose \
and self.kshp == other.kshp\
and self.imshp == other.imshp\
and self.max_threads_dim0 == other.max_threads_dim0
def __setstate__(self, d): def __setstate__(self, d):
self.__dict__.update(d) self.__dict__.update(d)
if not hasattr(self, "imshp"): if not hasattr(self, "imshp"):
...@@ -136,32 +119,6 @@ class GpuConv(GpuKernelBase, gof.Op): ...@@ -136,32 +119,6 @@ class GpuConv(GpuKernelBase, gof.Op):
if not hasattr(self, "fft_opt"): if not hasattr(self, "fft_opt"):
self.fft_opt = True self.fft_opt = True
def __hash__(self):
# don't use hash(self.version) as hash(-1)==-2 and
# hash(-2)==-2 in python!
return hash(type(self)) \
^ hash(self.border_mode) \
^ hash(self.subsample) \
^ hash(self.logical_img_hw) \
^ hash(self.logical_kern_hw) \
^ hash(self.logical_kern_align_top) \
^ self.version \
^ hash(self.verbose) \
^ hash(self.kshp)\
^ hash(self.imshp)\
^ hash(self.max_threads_dim0)
def __str__(self):
return '%s{%s, %s, %s, %s, %s, %s, %s}' % (
self.__class__.__name__,
self.border_mode,
str(self.subsample),
str(self.logical_img_hw),
str(self.logical_kern_hw),
str(self.logical_kern_align_top),
str(self.imshp),
str(self.kshp))
def make_node(self, img, kern): def make_node(self, img, kern):
if img.dtype != "float32" or kern.dtype != "float32": if img.dtype != "float32" or kern.dtype != "float32":
raise NotImplementedError("GpuConv currently only work" raise NotImplementedError("GpuConv currently only work"
...@@ -170,13 +127,17 @@ class GpuConv(GpuKernelBase, gof.Op): ...@@ -170,13 +127,17 @@ class GpuConv(GpuKernelBase, gof.Op):
raise TypeError('img must be 4D tensor') raise TypeError('img must be 4D tensor')
if kern.type.ndim != 4: if kern.type.ndim != 4:
raise TypeError('kern must be 4D tensor') raise TypeError('kern must be 4D tensor')
img = as_gpuarray_variable(img) ctx_name = infer_context_name(img, kern)
kern = as_gpuarray_variable(kern) img = as_gpuarray_variable(img, ctx_name)
kern = as_gpuarray_variable(kern, ctx_name)
broadcastable = [img.type.broadcastable[0], kern.type.broadcastable[0], broadcastable = [img.type.broadcastable[0], kern.type.broadcastable[0],
False, False] False, False]
out = GpuArrayType(img.dtype, broadcastable)() out = GpuArrayType(img.dtype, broadcastable, context_name=ctx_name)()
return gof.Apply(self, [img, kern], [out]) return gof.Apply(self, [img, kern], [out])
def get_context(self, node):
return node.inputs[0].type.context
def flops(self, inputs, outputs): def flops(self, inputs, outputs):
""" """
Useful with the hack in profilemode to print the MFlops. Useful with the hack in profilemode to print the MFlops.
...@@ -202,22 +163,8 @@ class GpuConv(GpuKernelBase, gof.Op): ...@@ -202,22 +163,8 @@ class GpuConv(GpuKernelBase, gof.Op):
def make_thunk(self, node, storage_map, compute_map, no_recycling): def make_thunk(self, node, storage_map, compute_map, no_recycling):
node_ = copy.copy(node) node_ = copy.copy(node)
assert node.op is node_.op assert node.op is node_.op
if config.gpuarray.sync:
raise NotImplementedError("GpuConv do not implement gpuarray.sync Theano flag")
if node_.op.max_threads_dim0 is None: if node_.op.max_threads_dim0 is None:
cuda = theano.sandbox.cuda node_.op.max_threads_dim0 = node_.inputs[0].type.context.maxlsize
device_id = cuda.use.device_number
if device_id is None:
cuda.use("gpu",
force=False,
default_to_move_computation_to_gpu=False,
move_shared_float32_to_gpu=False,
enable_cuda=False,
test_driver=True)
device_id = cuda.use.device_number
cuda_ndarray = theano.sandbox.cuda.cuda_ndarray.cuda_ndarray
prop = cuda_ndarray.device_properties(device_id)
node_.op.max_threads_dim0 = prop['maxThreadsDim0']
return super(GpuConv, node_.op).make_thunk(node_, storage_map, return super(GpuConv, node_.op).make_thunk(node_, storage_map,
compute_map, no_recycling) compute_map, no_recycling)
...@@ -232,9 +179,11 @@ class GpuConv(GpuKernelBase, gof.Op): ...@@ -232,9 +179,11 @@ class GpuConv(GpuKernelBase, gof.Op):
def c_code_cache_version(self): def c_code_cache_version(self):
# raise this whenever modifying any of the support_code_files # raise this whenever modifying any of the support_code_files
return (0, 22) return (0, 23)
def c_code(self, node, nodename, inp, out_, sub): def c_code(self, node, nodename, inp, out_, sub):
if node.inputs[0].type.context.kind != "cuda":
raise NotImplementedError("GpuConv only works for cuda devices")
img, kern = inp img, kern = inp
out, = out_ out, = out_
dx = self.subsample[0] dx = self.subsample[0]
...@@ -302,7 +251,6 @@ class GpuConv(GpuKernelBase, gof.Op): ...@@ -302,7 +251,6 @@ class GpuConv(GpuKernelBase, gof.Op):
""" % locals() """ % locals()
code += "\n".join([open(os.path.join(os.path.split(__file__)[0], f)).read() code += "\n".join([open(os.path.join(os.path.split(__file__)[0], f)).read()
for f in ["conv_kernel.cu", "conv_full_kernel.cu"]]) for f in ["conv_kernel.cu", "conv_full_kernel.cu"]])
kname = "conv_full_load_everything"
gk = gpuarray.GpuKernel(code, k.name, k.params, **k.flags) gk = gpuarray.GpuKernel(code, k.name, k.params, **k.flags)
bin = gk._binary bin = gk._binary
bcode = ','.join(hex(ord(c)) for c in bin) bcode = ','.join(hex(ord(c)) for c in bin)
...@@ -313,9 +261,12 @@ class GpuConv(GpuKernelBase, gof.Op): ...@@ -313,9 +261,12 @@ class GpuConv(GpuKernelBase, gof.Op):
static const char conv_bcode[] = {%(bcode)s}; static const char conv_bcode[] = {%(bcode)s};
static const char *conv_code = "%(code)s"; static const char *conv_code = "%(code)s";
""" % locals() """ % locals()
for k in kernels: return mod
mod += "static GpuKernel " + k.name + '_' + name + ";\n"
mod += open(os.path.join(os.path.split(__file__)[0], "conv.cu")).read() def c_support_code_struct(self, node, name):
mod = GpuKernelBase.c_support_code_struct(self, node, name)
with open(os.path.join(os.path.split(__file__)[0], "conv.cu")) as f:
mod += f.read()
return mod return mod
@utils.memoize @utils.memoize
......
...@@ -46,7 +46,7 @@ for (int iter_m=0; iter_m < Os[0]; iter_m++) { ...@@ -46,7 +46,7 @@ for (int iter_m=0; iter_m < Os[0]; iter_m++) {
//Must be the same size as a ptr. We can't use unsigned long as on Windows 64 //Must be the same size as a ptr. We can't use unsigned long as on Windows 64
//bit, it is 32 bit. //bit, it is 32 bit.
const uintptr_t COALESCED_ALIGN = 0xFFFFFFFFFFFFFF00; // zero-out the trailing bits of pointers const size_t COALESCED_ALIGN = 0xFFFFFFFFFFFFFF00; // zero-out the trailing bits of pointers
__device__ void load_to_shared(float * dst, const float * src, const int thread_id, int nb_thread, const int N, const bool flipped=false){ __device__ void load_to_shared(float * dst, const float * src, const int thread_id, int nb_thread, const int N, const bool flipped=false){
if (nb_thread < 64) if (nb_thread < 64)
...@@ -75,7 +75,7 @@ __device__ void load_to_shared(float * dst, const float * src, const int thread_ ...@@ -75,7 +75,7 @@ __device__ void load_to_shared(float * dst, const float * src, const int thread_
if (thread_id < nb_thread) if (thread_id < nb_thread)
{ {
const float * my_src_ptr = (const float *)( const float * my_src_ptr = (const float *)(
((uintptr_t)src) & COALESCED_ALIGN); ((size_t)src) & COALESCED_ALIGN);
my_src_ptr += thread_id; my_src_ptr += thread_id;
while (my_src_ptr < src + N) while (my_src_ptr < src + N)
{ {
......
...@@ -107,14 +107,14 @@ cudnnHandle_t APPLY_SPECIFIC(_handle); ...@@ -107,14 +107,14 @@ cudnnHandle_t APPLY_SPECIFIC(_handle);
#section init_code_struct #section init_code_struct
{ {
cuda_enter(pygpu_default_context()->ctx); cuda_enter(CONTEXT->ctx);
cudnnStatus_t err; cudnnStatus_t err;
APPLY_SPECIFIC(_handle) = NULL; APPLY_SPECIFIC(_handle) = NULL;
if ((err = cudnnCreate(&APPLY_SPECIFIC(_handle))) != CUDNN_STATUS_SUCCESS) { if ((err = cudnnCreate(&APPLY_SPECIFIC(_handle))) != CUDNN_STATUS_SUCCESS) {
PyErr_Format(PyExc_RuntimeError, "could not create cuDNN handle: %s", PyErr_Format(PyExc_RuntimeError, "could not create cuDNN handle: %s",
cudnnGetErrorString(err)); cudnnGetErrorString(err));
cuda_exit(pygpu_default_context()->ctx); cuda_exit(CONTEXT->ctx);
FAIL; FAIL;
} }
cuda_exit(pygpu_default_context()->ctx); cuda_exit(CONTEXT->ctx);
} }
...@@ -5,12 +5,12 @@ APPLY_SPECIFIC(conv_fwd)(PyGpuArrayObject *input, PyGpuArrayObject *kerns, ...@@ -5,12 +5,12 @@ APPLY_SPECIFIC(conv_fwd)(PyGpuArrayObject *input, PyGpuArrayObject *kerns,
PyGpuArrayObject *om, PyGpuArrayObject *om,
cudnnConvolutionDescriptor_t desc, cudnnConvolutionDescriptor_t desc,
double alpha, double beta, double alpha, double beta,
PyGpuArrayObject **output) { PyGpuArrayObject **output,
PyGpuContextObject *c) {
cudnnStatus_t err = CUDNN_STATUS_SUCCESS; cudnnStatus_t err = CUDNN_STATUS_SUCCESS;
float af = alpha, bf = beta; float af = alpha, bf = beta;
void *alpha_p; void *alpha_p;
void *beta_p; void *beta_p;
PyGpuContextObject *c = pygpu_default_context();
if (PyGpuArray_DIMS(input)[1] != PyGpuArray_DIMS(kerns)[1]) { if (PyGpuArray_DIMS(input)[1] != PyGpuArray_DIMS(kerns)[1]) {
PyErr_SetString(PyExc_ValueError, PyErr_SetString(PyExc_ValueError,
......
...@@ -4,12 +4,12 @@ int ...@@ -4,12 +4,12 @@ int
APPLY_SPECIFIC(conv_gi)(PyGpuArrayObject *kerns, PyGpuArrayObject *output, APPLY_SPECIFIC(conv_gi)(PyGpuArrayObject *kerns, PyGpuArrayObject *output,
PyGpuArrayObject *im, PyGpuArrayObject *im,
cudnnConvolutionDescriptor_t desc, cudnnConvolutionDescriptor_t desc,
double alpha, double beta, PyGpuArrayObject **input) { double alpha, double beta, PyGpuArrayObject **input,
PyGpuContextObject *c) {
cudnnStatus_t err = CUDNN_STATUS_SUCCESS; cudnnStatus_t err = CUDNN_STATUS_SUCCESS;
float af = alpha, bf = beta; float af = alpha, bf = beta;
void *alpha_p; void *alpha_p;
void *beta_p; void *beta_p;
PyGpuContextObject *c = pygpu_default_context();
if (PyGpuArray_DIMS(im)[1] != PyGpuArray_DIMS(kerns)[1]) { if (PyGpuArray_DIMS(im)[1] != PyGpuArray_DIMS(kerns)[1]) {
PyErr_SetString(PyExc_ValueError, "images and kernel must have the same " PyErr_SetString(PyExc_ValueError, "images and kernel must have the same "
......
...@@ -4,12 +4,12 @@ int ...@@ -4,12 +4,12 @@ int
APPLY_SPECIFIC(conv_gw)(PyGpuArrayObject *input, PyGpuArrayObject *output, APPLY_SPECIFIC(conv_gw)(PyGpuArrayObject *input, PyGpuArrayObject *output,
PyGpuArrayObject *km, PyGpuArrayObject *km,
cudnnConvolutionDescriptor_t desc, cudnnConvolutionDescriptor_t desc,
double alpha, double beta, PyGpuArrayObject **kerns) { double alpha, double beta, PyGpuArrayObject **kerns,
PyGpuContextObject *c) {
cudnnStatus_t err = CUDNN_STATUS_SUCCESS; cudnnStatus_t err = CUDNN_STATUS_SUCCESS;
float af = alpha, bf = beta; float af = alpha, bf = beta;
void *alpha_p; void *alpha_p;
void *beta_p; void *beta_p;
PyGpuContextObject *c = pygpu_default_context();
if (PyGpuArray_DIMS(input)[1] != PyGpuArray_DIMS(km)[1]) { if (PyGpuArray_DIMS(input)[1] != PyGpuArray_DIMS(km)[1]) {
PyErr_SetString(PyExc_ValueError, PyErr_SetString(PyExc_ValueError,
......
...@@ -29,10 +29,10 @@ if (APPLY_SPECIFIC(output) != NULL) { cudnnDestroyTensorDescriptor(APPLY_SPECIFI ...@@ -29,10 +29,10 @@ if (APPLY_SPECIFIC(output) != NULL) { cudnnDestroyTensorDescriptor(APPLY_SPECIFI
int APPLY_SPECIFIC(dnn_pool)(PyGpuArrayObject *img, int APPLY_SPECIFIC(dnn_pool)(PyGpuArrayObject *img,
cudnnPoolingDescriptor_t desc, cudnnPoolingDescriptor_t desc,
PyGpuArrayObject **out) { PyGpuArrayObject **out,
PyGpuContextObject *c) {
cudnnStatus_t err; cudnnStatus_t err;
size_t dims[5]; size_t dims[5];
PyGpuContextObject *c = pygpu_default_context();
if (!GpuArray_IS_C_CONTIGUOUS(&img->ga)) { if (!GpuArray_IS_C_CONTIGUOUS(&img->ga)) {
PyErr_SetString(PyExc_ValueError, "Only contiguous inputs are supported."); PyErr_SetString(PyExc_ValueError, "Only contiguous inputs are supported.");
......
...@@ -53,9 +53,9 @@ int APPLY_SPECIFIC(dnn_pool_grad)(PyGpuArrayObject *inp, ...@@ -53,9 +53,9 @@ int APPLY_SPECIFIC(dnn_pool_grad)(PyGpuArrayObject *inp,
PyGpuArrayObject *out, PyGpuArrayObject *out,
PyGpuArrayObject *out_grad, PyGpuArrayObject *out_grad,
cudnnPoolingDescriptor_t desc, cudnnPoolingDescriptor_t desc,
PyGpuArrayObject **inp_grad) { PyGpuArrayObject **inp_grad,
PyGpuContextObject *c) {
cudnnStatus_t err; cudnnStatus_t err;
PyGpuContextObject *c = pygpu_default_context();
if (!GpuArray_IS_C_CONTIGUOUS(&inp->ga)) { if (!GpuArray_IS_C_CONTIGUOUS(&inp->ga)) {
PyErr_SetString(PyExc_ValueError, "Only contiguous inputs are supported."); PyErr_SetString(PyExc_ValueError, "Only contiguous inputs are supported.");
...@@ -81,7 +81,7 @@ int APPLY_SPECIFIC(dnn_pool_grad)(PyGpuArrayObject *inp, ...@@ -81,7 +81,7 @@ int APPLY_SPECIFIC(dnn_pool_grad)(PyGpuArrayObject *inp,
if (theano_prep_output(inp_grad, PyGpuArray_NDIM(inp), if (theano_prep_output(inp_grad, PyGpuArray_NDIM(inp),
PyGpuArray_DIMS(inp), inp->ga.typecode, PyGpuArray_DIMS(inp), inp->ga.typecode,
GA_C_ORDER, pygpu_default_context()) != 0) { GA_C_ORDER, c) != 0) {
return 1; return 1;
} }
......
...@@ -34,9 +34,9 @@ if (APPLY_SPECIFIC(output) != NULL) ...@@ -34,9 +34,9 @@ if (APPLY_SPECIFIC(output) != NULL)
#section support_code_struct #section support_code_struct
int APPLY_SPECIFIC(softmax)(PyGpuArrayObject *x, int APPLY_SPECIFIC(softmax)(PyGpuArrayObject *x,
PyGpuArrayObject **out) { PyGpuArrayObject **out,
PyGpuContextObject *c) {
cudnnStatus_t err; cudnnStatus_t err;
PyGpuContextObject *c = pygpu_default_context();
if (c_set_tensorNd(x, APPLY_SPECIFIC(input)) != 0) if (c_set_tensorNd(x, APPLY_SPECIFIC(input)) != 0)
return 1; return 1;
......
...@@ -45,9 +45,9 @@ if (APPLY_SPECIFIC(dx) != NULL) ...@@ -45,9 +45,9 @@ if (APPLY_SPECIFIC(dx) != NULL)
int APPLY_SPECIFIC(softmax_grad)(PyGpuArrayObject *dy, int APPLY_SPECIFIC(softmax_grad)(PyGpuArrayObject *dy,
PyGpuArrayObject *sm, PyGpuArrayObject *sm,
PyGpuArrayObject **dx) { PyGpuArrayObject **dx,
PyGpuContextObject *c) {
cudnnStatus_t err; cudnnStatus_t err;
PyGpuContextObject *c = pygpu_default_context();
if (c_set_tensorNd(dy, APPLY_SPECIFIC(dy)) != 0) if (c_set_tensorNd(dy, APPLY_SPECIFIC(dy)) != 0)
return 1; return 1;
......
...@@ -2,7 +2,7 @@ ...@@ -2,7 +2,7 @@
/* Why do we need this? */ /* Why do we need this? */
size_t dim = 2048 * 32; size_t dim = 2048 * 32;
rand_buf = pygpu_empty(1, &dim, GA_UINT, GA_C_ORDER, pygpu_default_context(), rand_buf = pygpu_empty(1, &dim, GA_UINT, GA_C_ORDER, CONTEXT,
Py_None); Py_None);
if (rand_buf == NULL) { if (rand_buf == NULL) {
FAIL; FAIL;
...@@ -14,7 +14,8 @@ PyGpuArrayObject *rand_buf; ...@@ -14,7 +14,8 @@ PyGpuArrayObject *rand_buf;
int gemm16(PyGpuArrayObject *C, float alpha, int gemm16(PyGpuArrayObject *C, float alpha,
PyGpuArrayObject *A, PyGpuArrayObject *B, PyGpuArrayObject *A, PyGpuArrayObject *B,
float beta, PyGpuArrayObject **out) { float beta, PyGpuArrayObject **out,
PyGpuContextObject *c) {
PyGpuArrayObject *_A = NULL; PyGpuArrayObject *_A = NULL;
PyGpuArrayObject *_B = NULL; PyGpuArrayObject *_B = NULL;
GpuKernel *gk; GpuKernel *gk;
......
...@@ -10,7 +10,8 @@ try: ...@@ -10,7 +10,8 @@ try:
except ImportError: except ImportError:
pass pass
from .basic_ops import as_gpuarray_variable, GpuKernelBase, Kernel from .basic_ops import (as_gpuarray_variable, GpuKernelBase, Kernel,
infer_context_name)
from .opt import register_opt as register_gpu_opt, op_lifter from .opt import register_opt as register_gpu_opt, op_lifter
from .type import GpuArrayType from .type import GpuArrayType
...@@ -25,7 +26,7 @@ class GpuImages2Neibs(GpuKernelBase, Images2Neibs, Op): ...@@ -25,7 +26,7 @@ class GpuImages2Neibs(GpuKernelBase, Images2Neibs, Op):
self.mode = mode self.mode = mode
def make_node(self, ten4, neib_shape, neib_step): def make_node(self, ten4, neib_shape, neib_step):
ten4 = as_gpuarray_variable(ten4) ten4 = as_gpuarray_variable(ten4, infer_context_name(ten4))
neib_shape = T.as_tensor_variable(neib_shape) neib_shape = T.as_tensor_variable(neib_shape)
neib_step = T.as_tensor_variable(neib_step) neib_step = T.as_tensor_variable(neib_step)
...@@ -37,7 +38,11 @@ class GpuImages2Neibs(GpuKernelBase, Images2Neibs, Op): ...@@ -37,7 +38,11 @@ class GpuImages2Neibs(GpuKernelBase, Images2Neibs, Op):
return Apply(self, [ten4, neib_shape, neib_step], return Apply(self, [ten4, neib_shape, neib_step],
[GpuArrayType(broadcastable=(False, False), [GpuArrayType(broadcastable=(False, False),
dtype=ten4.type.dtype)()]) dtype=ten4.type.dtype,
context_name=ten4.type.context_name)()])
def get_context(self, node):
return node.inputs[0].type.context
def c_code_cache_version(self): def c_code_cache_version(self):
return (11,) return (11,)
...@@ -56,7 +61,7 @@ class GpuImages2Neibs(GpuKernelBase, Images2Neibs, Op): ...@@ -56,7 +61,7 @@ class GpuImages2Neibs(GpuKernelBase, Images2Neibs, Op):
kname = "k_multi_warp_less" kname = "k_multi_warp_less"
k_var = "k_multi_warp_less_" + nodename k_var = "k_multi_warp_less_" + nodename
code = """ code = """
//a version that use less register but don't work in all case. // a version that uses less registers but doesn't work in all cases.
KERNEL void %(kname)s( KERNEL void %(kname)s(
const int nb_batch, const int nb_batch,
const int nb_stack, const int nb_stack,
...@@ -233,6 +238,8 @@ class GpuImages2Neibs(GpuKernelBase, Images2Neibs, Op): ...@@ -233,6 +238,8 @@ class GpuImages2Neibs(GpuKernelBase, Images2Neibs, Op):
return kernels return kernels
def c_code(self, node, name, inp, out, sub): def c_code(self, node, name, inp, out, sub):
if node.inputs[0].type.context.kind != 'cuda':
raise NotImplementedError("cuda only")
dtype_ten4 = node.inputs[0].dtype dtype_ten4 = node.inputs[0].dtype
dtype_neib_shape = node.inputs[1].dtype dtype_neib_shape = node.inputs[1].dtype
dtype_neib_step = node.inputs[2].dtype dtype_neib_step = node.inputs[2].dtype
...@@ -243,6 +250,7 @@ class GpuImages2Neibs(GpuKernelBase, Images2Neibs, Op): ...@@ -243,6 +250,7 @@ class GpuImages2Neibs(GpuKernelBase, Images2Neibs, Op):
ten4, neib_shape, neib_step = inp ten4, neib_shape, neib_step = inp
z, = out z, = out
fail = sub['fail'] fail = sub['fail']
ctx = sub['context']
mode = self.mode mode = self.mode
err_check = """ err_check = """
if (err != GA_NO_ERROR) { if (err != GA_NO_ERROR) {
...@@ -369,8 +377,7 @@ class GpuImages2Neibs(GpuKernelBase, Images2Neibs, Op): ...@@ -369,8 +377,7 @@ class GpuImages2Neibs(GpuKernelBase, Images2Neibs, Op):
dims[0] = z_dim0; dims[0] = z_dim0;
dims[1] = z_dim1; dims[1] = z_dim1;
%(z)s = pygpu_empty(2, dims, %(typecode_z)s, %(z)s = pygpu_empty(2, dims, %(typecode_z)s,
GA_C_ORDER, pygpu_default_context(), GA_C_ORDER, %(ctx)s, Py_None);
Py_None);
if (!%(z)s) if (!%(z)s)
{ {
PyErr_SetString(PyExc_MemoryError, "GpuImages2Neibs:" PyErr_SetString(PyExc_MemoryError, "GpuImages2Neibs:"
...@@ -453,7 +460,7 @@ class GpuImages2Neibs(GpuKernelBase, Images2Neibs, Op): ...@@ -453,7 +460,7 @@ class GpuImages2Neibs(GpuKernelBase, Images2Neibs, Op):
@op_lifter([Images2Neibs]) @op_lifter([Images2Neibs])
def use_gpu_images2neibs(node): def use_gpu_images2neibs(node, context_name):
if node.op.mode in ['valid', 'ignore_borders', 'wrap_centered']: if node.op.mode in ['valid', 'ignore_borders', 'wrap_centered']:
return GpuImages2Neibs(node.op.mode) return GpuImages2Neibs(node.op.mode)
......
...@@ -8,10 +8,10 @@ from theano.gof import local_optimizer, COp ...@@ -8,10 +8,10 @@ from theano.gof import local_optimizer, COp
from theano.scalar import as_scalar, constant from theano.scalar import as_scalar, constant
from . import opt from . import opt
from .basic_ops import (as_gpuarray_variable, GpuAllocEmpty) from .basic_ops import (as_gpuarray_variable, GpuAllocEmpty,
infer_context_name)
from .type import gpu_context_type
from .opt_util import alpha_merge, output_merge from .opt_util import alpha_merge, output_merge
from .pycuda_helper import ensure_pycuda_context
try: try:
from nervanagpu.nervanagpu import GPUTensor, NervanaGPU from nervanagpu.nervanagpu import GPUTensor, NervanaGPU
...@@ -43,6 +43,7 @@ def ensure_float(val, name): ...@@ -43,6 +43,7 @@ def ensure_float(val, name):
class Gemm16(COp): class Gemm16(COp):
__props__ = ('relu', 'inplace') __props__ = ('relu', 'inplace')
_f16_ok = True _f16_ok = True
context_type = gpu_context_type
KERN_NAMES = ('nn_128x128', 'nn_128x64', 'nn_128x32', KERN_NAMES = ('nn_128x128', 'nn_128x64', 'nn_128x32',
'nn_vec_128x128', 'nn_vec_128x64', 'nn_vec_128x32', 'nn_vec_128x128', 'nn_vec_128x64', 'nn_vec_128x32',
'tn_128x128', 'tn_128x64', 'tn_128x32', 'tn_128x128', 'tn_128x64', 'tn_128x32',
...@@ -61,10 +62,11 @@ class Gemm16(COp): ...@@ -61,10 +62,11 @@ class Gemm16(COp):
def make_node(self, C, alpha, A, B, beta): def make_node(self, C, alpha, A, B, beta):
if GPUTensor is None: if GPUTensor is None:
raise RuntimeError("Can't use Gemm16: nervanagpu not found") raise RuntimeError("Can't use Gemm16: nervanagpu not found")
ctx_name = infer_context_name(C, A, B)
A = as_gpuarray_variable(A) A = as_gpuarray_variable(A, ctx_name)
B = as_gpuarray_variable(B) B = as_gpuarray_variable(B, ctx_name)
C = as_gpuarray_variable(C) C = as_gpuarray_variable(C, ctx_name)
alpha = ensure_float(alpha, 'alpha') alpha = ensure_float(alpha, 'alpha')
beta = ensure_float(beta, 'beta') beta = ensure_float(beta, 'beta')
...@@ -73,27 +75,8 @@ class Gemm16(COp): ...@@ -73,27 +75,8 @@ class Gemm16(COp):
return Apply(self, [C, alpha, A, B, beta], [C.type()]) return Apply(self, [C, alpha, A, B, beta], [C.type()])
def perform(self, node, inputs, outputs): def get_context(self, node):
ensure_pycuda_context() return node.inputs[0].type.context
C, alpha, A, B, beta = inputs
# The nervana code does not support the case where both inputs
# are trans, so we need to copy one if them if that is the
# case. We copy the smaller one.
if A.flags.f_contiguous and B.flags.f_contiguous:
if A.size < B.size:
A = A.copy()
else:
B = B.copy()
inplace = self.inplace
if inplace and not C.flags.c_contiguous:
inplace = False
if not inplace:
C = C.copy()
At = to_gputensor(A)
Bt = to_gputensor(B)
Ct = to_gputensor(C)
nerv.dot(At, Bt, Ct, alpha=alpha, beta=beta, relu=False)
outputs[0][0] = C
def c_headers(self): def c_headers(self):
return ['gpuarray/types.h', 'numpy_compat.h', 'gpuarray_helper.h', return ['gpuarray/types.h', 'numpy_compat.h', 'gpuarray_helper.h',
...@@ -145,7 +128,7 @@ if (GpuKernel_init(&k_%(name)s, c->ops, c->ctx, 1, &bcode, &sz, ...@@ -145,7 +128,7 @@ if (GpuKernel_init(&k_%(name)s, c->ops, c->ctx, 1, &bcode, &sz,
codel.append("memset(&k_{0}, 0, sizeof(GpuKernel));".format(name)) codel.append("memset(&k_{0}, 0, sizeof(GpuKernel));".format(name))
codel.append("const char *bcode;") codel.append("const char *bcode;")
codel.append("size_t sz;") codel.append("size_t sz;")
codel.append("PyGpuContextObject *c = pygpu_default_context();") codel.append("PyGpuContextObject *c = %s;" % (sub['context'],))
codel.append("int types[13] = {GA_BUFFER, GA_BUFFER, GA_BUFFER, " codel.append("int types[13] = {GA_BUFFER, GA_BUFFER, GA_BUFFER, "
"GA_BUFFER, GA_INT, GA_INT, GA_INT, GA_INT, GA_INT, " "GA_BUFFER, GA_INT, GA_INT, GA_INT, GA_INT, GA_INT, "
"GA_INT, GA_FLOAT, GA_FLOAT, GA_INT};") "GA_INT, GA_FLOAT, GA_FLOAT, GA_INT};")
...@@ -162,7 +145,7 @@ if (GpuKernel_init(&k_%(name)s, c->ops, c->ctx, 1, &bcode, &sz, ...@@ -162,7 +145,7 @@ if (GpuKernel_init(&k_%(name)s, c->ops, c->ctx, 1, &bcode, &sz,
@opt.register_opt() @opt.register_opt()
@opt.op_lifter([tensor.Dot]) @opt.op_lifter([tensor.Dot])
def local_dot_to_gemm16(node): def local_dot_to_gemm16(node, ctx_name):
if nerv is None: if nerv is None:
return return
A = node.inputs[0] A = node.inputs[0]
...@@ -170,7 +153,7 @@ def local_dot_to_gemm16(node): ...@@ -170,7 +153,7 @@ def local_dot_to_gemm16(node):
if (A.ndim == 2 and B.ndim == 2 and if (A.ndim == 2 and B.ndim == 2 and
A.dtype == 'float16' and B.dtype == 'float16'): A.dtype == 'float16' and B.dtype == 'float16'):
fgraph = node.inputs[0].fgraph fgraph = node.inputs[0].fgraph
C = GpuAllocEmpty(dtype='float16')( C = GpuAllocEmpty(dtype='float16', context_name=ctx_name)(
shape_i(A, 0, fgraph), shape_i(B, 1, fgraph)) shape_i(A, 0, fgraph), shape_i(B, 1, fgraph))
return Gemm16()(C, 1.0, A, B, 0.0) return Gemm16()(C, 1.0, A, B, 0.0)
......
...@@ -10,7 +10,8 @@ try: ...@@ -10,7 +10,8 @@ try:
except ImportError: except ImportError:
pass pass
from .basic_ops import (as_gpuarray_variable, GpuKernelBase, Kernel) from .basic_ops import (as_gpuarray_variable, GpuKernelBase, Kernel,
infer_context_name)
from .type import GpuArrayType from .type import GpuArrayType
from .kernel_codegen import (nvcc_kernel, from .kernel_codegen import (nvcc_kernel,
inline_softmax, inline_softmax,
...@@ -23,23 +24,26 @@ class GpuCrossentropySoftmaxArgmax1HotWithBias(GpuKernelBase, Op): ...@@ -23,23 +24,26 @@ class GpuCrossentropySoftmaxArgmax1HotWithBias(GpuKernelBase, Op):
Implement CrossentropySoftmaxArgmax1HotWithBias on the gpu. Implement CrossentropySoftmaxArgmax1HotWithBias on the gpu.
""" """
nin = 3 nin = 3
nout = 3 nout = 3
__props__ = () __props__ = ()
_f16_ok = True _f16_ok = True
def make_node(self, x, b, y_idx): def make_node(self, x, b, y_idx):
# N.B. won't work when we don't cast y_idx to float anymore ctx_name = infer_context_name(x, b, y_idx)
x = as_gpuarray_variable(x) x = as_gpuarray_variable(x, ctx_name)
b = as_gpuarray_variable(b) b = as_gpuarray_variable(b, ctx_name)
y_idx = as_gpuarray_variable(y_idx) y_idx = as_gpuarray_variable(y_idx, ctx_name)
nll = GpuArrayType(x.type.dtype, nll = GpuArrayType(x.type.dtype,
y_idx.type.broadcastable)() y_idx.type.broadcastable,
context_name=ctx_name)()
sm = x.type() sm = x.type()
am = y_idx.type() am = y_idx.type()
return Apply(self, [x, b, y_idx], [nll, sm, am]) return Apply(self, [x, b, y_idx], [nll, sm, am])
def get_context(self, node):
return node.inputs[0].type.context
def c_headers(self): def c_headers(self):
return ['<numpy_compat.h>', '<gpuarray/types.h>'] return ['<numpy_compat.h>', '<gpuarray/types.h>']
...@@ -144,6 +148,8 @@ class GpuCrossentropySoftmaxArgmax1HotWithBias(GpuKernelBase, Op): ...@@ -144,6 +148,8 @@ class GpuCrossentropySoftmaxArgmax1HotWithBias(GpuKernelBase, Op):
flags=flags, objvar=k_var)] flags=flags, objvar=k_var)]
def c_code(self, node, nodename, inp, out, sub): def c_code(self, node, nodename, inp, out, sub):
if node.inputs[0].type.context.kind != 'cuda':
raise NotImplementedError('cuda only')
typecode_x = pygpu.gpuarray.dtype_to_typecode(node.inputs[0].dtype) typecode_x = pygpu.gpuarray.dtype_to_typecode(node.inputs[0].dtype)
typecode_b = pygpu.gpuarray.dtype_to_typecode(node.inputs[1].dtype) typecode_b = pygpu.gpuarray.dtype_to_typecode(node.inputs[1].dtype)
typecode_y_idx = pygpu.gpuarray.dtype_to_typecode(node.inputs[2].dtype) typecode_y_idx = pygpu.gpuarray.dtype_to_typecode(node.inputs[2].dtype)
...@@ -163,6 +169,7 @@ class GpuCrossentropySoftmaxArgmax1HotWithBias(GpuKernelBase, Op): ...@@ -163,6 +169,7 @@ class GpuCrossentropySoftmaxArgmax1HotWithBias(GpuKernelBase, Op):
dtype_am = node.outputs[2].dtype dtype_am = node.outputs[2].dtype
classname = self.__class__.__name__ classname = self.__class__.__name__
fail = sub['fail'] fail = sub['fail']
ctx = sub['context']
k_var = "k_xent_sm_1hot_bias_%(nodename)s" % locals() k_var = "k_xent_sm_1hot_bias_%(nodename)s" % locals()
err_check = """ err_check = """
if (err != GA_NO_ERROR) { if (err != GA_NO_ERROR) {
...@@ -214,9 +221,8 @@ class GpuCrossentropySoftmaxArgmax1HotWithBias(GpuKernelBase, Op): ...@@ -214,9 +221,8 @@ class GpuCrossentropySoftmaxArgmax1HotWithBias(GpuKernelBase, Op):
{ {
Py_XDECREF(%(nll)s); Py_XDECREF(%(nll)s);
%(nll)s = pygpu_empty(1, PyGpuArray_DIMS(%(y_idx)s), %(nll)s = pygpu_empty(1, PyGpuArray_DIMS(%(y_idx)s),
%(typecode_x)s, %(typecode_x)s, GA_C_ORDER, %(ctx)s,
GA_C_ORDER, Py_None);
pygpu_default_context(), Py_None);
if (!%(nll)s) { if (!%(nll)s) {
%(fail)s %(fail)s
} }
...@@ -229,9 +235,8 @@ class GpuCrossentropySoftmaxArgmax1HotWithBias(GpuKernelBase, Op): ...@@ -229,9 +235,8 @@ class GpuCrossentropySoftmaxArgmax1HotWithBias(GpuKernelBase, Op):
{ {
Py_XDECREF(%(sm)s); Py_XDECREF(%(sm)s);
%(sm)s = pygpu_empty(2, PyGpuArray_DIMS(%(x)s), %(sm)s = pygpu_empty(2, PyGpuArray_DIMS(%(x)s),
%(typecode_b)s, %(typecode_b)s, GA_C_ORDER,
GA_C_ORDER, %(ctx)s, Py_None);
pygpu_default_context(), Py_None);
if(!%(sm)s) if(!%(sm)s)
{ {
PyErr_SetString(PyExc_MemoryError, PyErr_SetString(PyExc_MemoryError,
...@@ -246,9 +251,8 @@ class GpuCrossentropySoftmaxArgmax1HotWithBias(GpuKernelBase, Op): ...@@ -246,9 +251,8 @@ class GpuCrossentropySoftmaxArgmax1HotWithBias(GpuKernelBase, Op):
{ {
Py_XDECREF(%(am)s); Py_XDECREF(%(am)s);
%(am)s = pygpu_empty(1, PyGpuArray_DIMS(%(y_idx)s), %(am)s = pygpu_empty(1, PyGpuArray_DIMS(%(y_idx)s),
%(typecode_y_idx)s, %(typecode_y_idx)s, GA_C_ORDER,
GA_C_ORDER, %(ctx)s, Py_None);
pygpu_default_context(), Py_None);
if(!%(am)s) if(!%(am)s)
{ {
PyErr_SetString(PyExc_MemoryError, PyErr_SetString(PyExc_MemoryError,
...@@ -306,18 +310,21 @@ class GpuCrossentropySoftmax1HotWithBiasDx(GpuKernelBase, Op): ...@@ -306,18 +310,21 @@ class GpuCrossentropySoftmax1HotWithBiasDx(GpuKernelBase, Op):
Gradient wrt x of the CrossentropySoftmax1Hot Op. Gradient wrt x of the CrossentropySoftmax1Hot Op.
""" """
nin = 3 nin = 3
nout = 1 nout = 1
__props__ = () __props__ = ()
_f16_ok = True _f16_ok = True
def make_node(self, dnll, sm, y_idx): def make_node(self, dnll, sm, y_idx):
dnll = as_gpuarray_variable(dnll) ctx_name = infer_context_name(dnll, sm, y_idx)
sm = as_gpuarray_variable(sm) dnll = as_gpuarray_variable(dnll, ctx_name)
y_idx = as_gpuarray_variable(y_idx) sm = as_gpuarray_variable(sm, ctx_name)
y_idx = as_gpuarray_variable(y_idx, ctx_name)
return Apply(self, [dnll, sm, y_idx], [sm.type()]) return Apply(self, [dnll, sm, y_idx], [sm.type()])
def get_context(self, node):
return node.inputs[0].type.context
def c_code_cache_version(self): def c_code_cache_version(self):
return (11,) return (11,)
...@@ -325,6 +332,8 @@ class GpuCrossentropySoftmax1HotWithBiasDx(GpuKernelBase, Op): ...@@ -325,6 +332,8 @@ class GpuCrossentropySoftmax1HotWithBiasDx(GpuKernelBase, Op):
return ['<numpy_compat.h>', '<gpuarray/types.h>'] return ['<numpy_compat.h>', '<gpuarray/types.h>']
def c_code(self, node, nodename, inp, out, sub): def c_code(self, node, nodename, inp, out, sub):
if node.inputs[0].type.context.kind != 'cuda':
raise NotImplementedError("cuda only")
typecode_dx = pygpu.gpuarray.dtype_to_typecode(node.outputs[0].dtype) typecode_dx = pygpu.gpuarray.dtype_to_typecode(node.outputs[0].dtype)
itemsize_dnll = numpy.dtype(node.inputs[0].dtype).itemsize itemsize_dnll = numpy.dtype(node.inputs[0].dtype).itemsize
itemsize_sm = numpy.dtype(node.inputs[1].dtype).itemsize itemsize_sm = numpy.dtype(node.inputs[1].dtype).itemsize
...@@ -338,6 +347,7 @@ class GpuCrossentropySoftmax1HotWithBiasDx(GpuKernelBase, Op): ...@@ -338,6 +347,7 @@ class GpuCrossentropySoftmax1HotWithBiasDx(GpuKernelBase, Op):
dnll, sm, y_idx = inp dnll, sm, y_idx = inp
dx, = out dx, = out
fail = sub['fail'] fail = sub['fail']
ctx = sub['context']
k_var = "kCrossEntropySoftmax1HotWithBiasDx_" + nodename k_var = "kCrossEntropySoftmax1HotWithBiasDx_" + nodename
err_check = """ err_check = """
if (err != GA_NO_ERROR) { if (err != GA_NO_ERROR) {
...@@ -403,9 +413,8 @@ class GpuCrossentropySoftmax1HotWithBiasDx(GpuKernelBase, Op): ...@@ -403,9 +413,8 @@ class GpuCrossentropySoftmax1HotWithBiasDx(GpuKernelBase, Op):
{ {
Py_XDECREF(%(dx)s); Py_XDECREF(%(dx)s);
%(dx)s = pygpu_empty(2, PyGpuArray_DIMS(%(sm)s), %(dx)s = pygpu_empty(2, PyGpuArray_DIMS(%(sm)s),
%(typecode_dx)s, %(typecode_dx)s, GA_C_ORDER,
GA_C_ORDER, %(ctx)s, Py_None);
pygpu_default_context(), Py_None);
if (!%(dx)s) { if (!%(dx)s) {
%(fail)s %(fail)s
} }
...@@ -512,14 +521,16 @@ class GpuSoftmax(GpuKernelBase, Op): ...@@ -512,14 +521,16 @@ class GpuSoftmax(GpuKernelBase, Op):
Implement Softmax on the gpu. Implement Softmax on the gpu.
""" """
__props__ = () __props__ = ()
_f16_ok = True _f16_ok = True
def make_node(self, x): def make_node(self, x):
x = as_gpuarray_variable(x) x = as_gpuarray_variable(x, infer_context_name(x))
return Apply(self, [x], [x.type()]) return Apply(self, [x], [x.type()])
def get_context(self, node):
return node.inputs[0].type.context
def infer_shape(self, node, shape): def infer_shape(self, node, shape):
return shape return shape
...@@ -530,6 +541,8 @@ class GpuSoftmax(GpuKernelBase, Op): ...@@ -530,6 +541,8 @@ class GpuSoftmax(GpuKernelBase, Op):
return ['<numpy_compat.h>', '<gpuarray/types.h>'] return ['<numpy_compat.h>', '<gpuarray/types.h>']
def c_code(self, node, nodename, inp, out, sub): def c_code(self, node, nodename, inp, out, sub):
if node.inputs[0].type.context.kind != 'cuda':
raise NotImplementedError("cuda only")
dtype_x = node.inputs[0].dtype dtype_x = node.inputs[0].dtype
work_x = work_dtype(dtype_x) work_x = work_dtype(dtype_x)
dtype_z = node.outputs[0].dtype dtype_z = node.outputs[0].dtype
...@@ -539,6 +552,7 @@ class GpuSoftmax(GpuKernelBase, Op): ...@@ -539,6 +552,7 @@ class GpuSoftmax(GpuKernelBase, Op):
x, = inp x, = inp
z, = out z, = out
fail = sub['fail'] fail = sub['fail']
ctx = sub['context']
err_check = """ err_check = """
if (err != GA_NO_ERROR) { if (err != GA_NO_ERROR) {
PyErr_Format(PyExc_RuntimeError, fmt_str, msg); PyErr_Format(PyExc_RuntimeError, fmt_str, msg);
...@@ -568,9 +582,8 @@ class GpuSoftmax(GpuKernelBase, Op): ...@@ -568,9 +582,8 @@ class GpuSoftmax(GpuKernelBase, Op):
{ {
Py_XDECREF(%(z)s); Py_XDECREF(%(z)s);
%(z)s = pygpu_empty(2, PyGpuArray_DIMS(%(x)s), %(z)s = pygpu_empty(2, PyGpuArray_DIMS(%(x)s),
%(typecode)s, %(typecode)s, GA_C_ORDER,
GA_C_ORDER, %(ctx)s, Py_None);
pygpu_default_context(), Py_None);
if (!%(z)s) { if (!%(z)s) {
%(fail)s %(fail)s
} }
...@@ -698,22 +711,25 @@ class GpuSoftmax(GpuKernelBase, Op): ...@@ -698,22 +711,25 @@ class GpuSoftmax(GpuKernelBase, Op):
gpu_softmax = GpuSoftmax() gpu_softmax = GpuSoftmax()
class GpuSoftmaxWithBias (GpuKernelBase, Op): class GpuSoftmaxWithBias(GpuKernelBase, Op):
""" """
Implement SoftmaxWithBias on the gpu. Implement SoftmaxWithBias on the gpu.
""" """
nin = 2 nin = 2
nout = 1 nout = 1
__props__ = () __props__ = ()
_f16_ok = True _f16_ok = True
def make_node(self, x, b): def make_node(self, x, b):
x = as_gpuarray_variable(x) ctx_name = infer_context_name(x, b)
b = as_gpuarray_variable(b) x = as_gpuarray_variable(x, ctx_name)
b = as_gpuarray_variable(b, ctx_name)
return Apply(self, [x, b], [x.type()]) return Apply(self, [x, b], [x.type()])
def get_context(self, node):
return node.inputs[0].type.context
def infer_shape(self, node, shape): def infer_shape(self, node, shape):
return [shape[0]] return [shape[0]]
...@@ -724,6 +740,8 @@ class GpuSoftmaxWithBias (GpuKernelBase, Op): ...@@ -724,6 +740,8 @@ class GpuSoftmaxWithBias (GpuKernelBase, Op):
return ['<numpy_compat.h>', '<gpuarray/types.h>'] return ['<numpy_compat.h>', '<gpuarray/types.h>']
def c_code(self, node, nodename, inp, out, sub): def c_code(self, node, nodename, inp, out, sub):
if node.inputs[0].type.context.kind != 'cuda':
raise NotImplementedError('cuda only')
dtype_x = node.inputs[0].dtype dtype_x = node.inputs[0].dtype
dtype_b = node.inputs[1].dtype dtype_b = node.inputs[1].dtype
dtype_z = node.outputs[0].dtype dtype_z = node.outputs[0].dtype
...@@ -735,6 +753,7 @@ class GpuSoftmaxWithBias (GpuKernelBase, Op): ...@@ -735,6 +753,7 @@ class GpuSoftmaxWithBias (GpuKernelBase, Op):
x, b = inp x, b = inp
z, = out z, = out
fail = sub['fail'] fail = sub['fail']
ctx = sub['context']
err_check = """ err_check = """
if (err != GA_NO_ERROR) { if (err != GA_NO_ERROR) {
PyErr_Format(PyExc_RuntimeError, fmt_str, msg); PyErr_Format(PyExc_RuntimeError, fmt_str, msg);
...@@ -777,9 +796,8 @@ class GpuSoftmaxWithBias (GpuKernelBase, Op): ...@@ -777,9 +796,8 @@ class GpuSoftmaxWithBias (GpuKernelBase, Op):
{ {
Py_XDECREF(%(z)s); Py_XDECREF(%(z)s);
%(z)s = pygpu_empty(2, PyGpuArray_DIMS(%(x)s), %(z)s = pygpu_empty(2, PyGpuArray_DIMS(%(x)s),
%(typecode)s, %(typecode)s, GA_C_ORDER,
GA_C_ORDER, %(ctx)s, Py_None);
pygpu_default_context(), Py_None);
if (!%(z)s) { if (!%(z)s) {
%(fail)s %(fail)s
} }
......
...@@ -294,7 +294,7 @@ def inplace_allocempty(op, idx): ...@@ -294,7 +294,7 @@ def inplace_allocempty(op, idx):
function can be as simple as: function can be as simple as:
def maker(node, inputs): def maker(node, inputs):
return node.op.__class__(inplace=True)(*inputs) return [node.op.__class__(inplace=True)(*inputs)]
Parameters Parameters
---------- ----------
...@@ -320,7 +320,8 @@ def inplace_allocempty(op, idx): ...@@ -320,7 +320,8 @@ def inplace_allocempty(op, idx):
if (alloc.owner and if (alloc.owner and
isinstance(alloc.owner.op, GpuAllocEmpty) and isinstance(alloc.owner.op, GpuAllocEmpty) and
len(alloc.clients) > 1): len(alloc.clients) > 1):
alloc_op = GpuAllocEmpty(alloc.owner.op.dtype) alloc_op = GpuAllocEmpty(alloc.owner.op.dtype,
alloc.owner.op.context_name)
inputs[idx] = alloc_op(*alloc.owner.inputs) inputs[idx] = alloc_op(*alloc.owner.inputs)
return maker(node, inputs) return maker(node, inputs)
return opt return opt
......
try:
from pycuda.driver import Context
if not hasattr(Context, 'attach'):
raise ImportError('too old')
except ImportError:
Context = None
pycuda_initialized = False
pycuda_context = None
def ensure_pycuda_context():
global pycuda_context, pycuda_initialized
if not pycuda_initialized:
if Context is None:
raise RuntimeError("PyCUDA not found or too old.")
else:
pycuda_context = Context.attach()
import atexit
atexit.register(pycuda_context.detach)
pycuda_initialized = True
return pycuda_context
from __future__ import print_function from __future__ import print_function
import copy
import os import os
import copy
import numpy import numpy
import theano import theano
from theano import tensor, gof, config from theano import tensor, gof
from theano.gof.utils import MethodNotDefined
from six.moves import StringIO from six.moves import StringIO
from theano.tensor.subtensor import IncSubtensor, Subtensor, get_idx_list from theano.tensor.subtensor import IncSubtensor, Subtensor, get_idx_list
import theano.tensor.inplace import theano.tensor.inplace
...@@ -19,7 +18,8 @@ except ImportError: ...@@ -19,7 +18,8 @@ except ImportError:
pass pass
from .type import GpuArrayType from .type import GpuArrayType
from .basic_ops import (as_gpuarray_variable, HideC, GpuKernelBase, Kernel) from .basic_ops import (as_gpuarray_variable, HideC, GpuKernelBase, Kernel,
infer_context_name)
from .elemwise import GpuElemwise from .elemwise import GpuElemwise
...@@ -27,10 +27,12 @@ class GpuSubtensor(HideC, Subtensor): ...@@ -27,10 +27,12 @@ class GpuSubtensor(HideC, Subtensor):
_f16_ok = True _f16_ok = True
def make_node(self, x, *inputs): def make_node(self, x, *inputs):
ctx_name = infer_context_name(x)
rval = tensor.Subtensor.make_node(self, x, *inputs) rval = tensor.Subtensor.make_node(self, x, *inputs)
otype = GpuArrayType(dtype=rval.outputs[0].type.dtype, otype = GpuArrayType(dtype=rval.outputs[0].type.dtype,
broadcastable=rval.outputs[0].type.broadcastable) broadcastable=rval.outputs[0].type.broadcastable,
x = as_gpuarray_variable(x) context_name=ctx_name)
x = as_gpuarray_variable(x, ctx_name)
return gof.Apply(self, [x] + rval.inputs[1:], [otype()]) return gof.Apply(self, [x] + rval.inputs[1:], [otype()])
def perform(self, node, inputs, out_): def perform(self, node, inputs, out_):
...@@ -191,14 +193,18 @@ class GpuIncSubtensor(GpuKernelBase, IncSubtensor): ...@@ -191,14 +193,18 @@ class GpuIncSubtensor(GpuKernelBase, IncSubtensor):
return self.iadd_node.op.gpu_kernels(self.iadd_node, subname) return self.iadd_node.op.gpu_kernels(self.iadd_node, subname)
def make_node(self, x, y, *inputs): def make_node(self, x, y, *inputs):
x = as_gpuarray_variable(x) ctx_name = infer_context_name(x, y)
y = as_gpuarray_variable(y) x = as_gpuarray_variable(x, ctx_name)
y = as_gpuarray_variable(y, ctx_name)
rval = tensor.IncSubtensor.make_node(self, x, y, *inputs) rval = tensor.IncSubtensor.make_node(self, x, y, *inputs)
op = copy.copy(self) op = copy.copy(self)
ret = gof.Apply(op, [x, y] + rval.inputs[2:], [x.type()]) ret = gof.Apply(op, [x, y] + rval.inputs[2:], [x.type()])
op.create_iadd_node(ret) op.create_iadd_node(ret)
return ret return ret
def get_context(self, node):
return node.outputs[0].type.context
def create_iadd_node(self, node): def create_iadd_node(self, node):
# We store a iadd_node in the op that contain the info needed # We store a iadd_node in the op that contain the info needed
# for the inplace add. # for the inplace add.
...@@ -210,7 +216,7 @@ class GpuIncSubtensor(GpuKernelBase, IncSubtensor): ...@@ -210,7 +216,7 @@ class GpuIncSubtensor(GpuKernelBase, IncSubtensor):
iadd_node = gop(xview, y).owner iadd_node = gop(xview, y).owner
self.iadd_node = iadd_node self.iadd_node = iadd_node
def perform(self, node, inputs, out_): def perform(self, node, inputs, out_, ctx):
out, = out_ out, = out_
x, y = inputs[:2] x, y = inputs[:2]
indices = list(reversed(inputs[2:])) indices = list(reversed(inputs[2:]))
...@@ -321,7 +327,7 @@ class GpuIncSubtensor(GpuKernelBase, IncSubtensor): ...@@ -321,7 +327,7 @@ class GpuIncSubtensor(GpuKernelBase, IncSubtensor):
%(view_ndim)s, %(view_ndim)s,
dims, dims,
xview_strides, xview_strides,
pygpu_default_context(), %(x)s->context,
1, 1,
(PyObject *)%(x)s, (PyObject *)%(x)s,
(PyObject *)&PyGpuArrayType); (PyObject *)&PyGpuArrayType);
...@@ -355,10 +361,10 @@ class GpuIncSubtensor(GpuKernelBase, IncSubtensor): ...@@ -355,10 +361,10 @@ class GpuIncSubtensor(GpuKernelBase, IncSubtensor):
""" """
return """GpuArray_setarray(&%(view)s->ga, &%(source)s->ga)""" % locals() return """GpuArray_setarray(&%(view)s->ga, &%(source)s->ga)""" % locals()
def c_support_code_apply(self, node, nodename): def c_support_code_struct(self, node, nodename):
gop = self.iadd_node.op gop = self.iadd_node.op
sub_name = nodename + "_add_to_zview" sub_name = nodename + "_add_to_zview"
ret = gop.c_support_code_apply(self.iadd_node, sub_name) ret = gop.c_support_code_struct(self.iadd_node, sub_name)
ret += """ ret += """
PyGpuArrayObject* inc_sub_iadd_%(nodename)s(PyGpuArrayObject* dst, PyGpuArrayObject* inc_sub_iadd_%(nodename)s(PyGpuArrayObject* dst,
PyGpuArrayObject* src){ PyGpuArrayObject* src){
...@@ -366,10 +372,11 @@ class GpuIncSubtensor(GpuKernelBase, IncSubtensor): ...@@ -366,10 +372,11 @@ class GpuIncSubtensor(GpuKernelBase, IncSubtensor):
""" % locals() """ % locals()
inputs = ["dst", "src"] inputs = ["dst", "src"]
outputs = ["ret"] outputs = ["ret"]
sub = {"fail": "return NULL;"} sub = {"fail": "return NULL;", "context": "dst->context"}
ret += gop.c_code(self.iadd_node, sub_name, inputs, outputs, sub) ret += gop.c_code(self.iadd_node, sub_name, inputs, outputs, sub)
ret += """ ret += """
return dst; return ret;
} }
""" """
return ret return ret
...@@ -399,7 +406,8 @@ class GpuIncSubtensor(GpuKernelBase, IncSubtensor): ...@@ -399,7 +406,8 @@ class GpuIncSubtensor(GpuKernelBase, IncSubtensor):
class GpuAdvancedSubtensor1(HideC, tensor.AdvancedSubtensor1): class GpuAdvancedSubtensor1(HideC, tensor.AdvancedSubtensor1):
def make_node(self, x, ilist): def make_node(self, x, ilist):
x_ = as_gpuarray_variable(x) ctx_name = infer_context_name(x, ilist)
x_ = as_gpuarray_variable(x, ctx_name)
ilist__ = tensor.as_tensor_variable(ilist) ilist__ = tensor.as_tensor_variable(ilist)
if ilist__.type.dtype[:3] not in ('int', 'uin'): if ilist__.type.dtype[:3] not in ('int', 'uin'):
...@@ -407,7 +415,7 @@ class GpuAdvancedSubtensor1(HideC, tensor.AdvancedSubtensor1): ...@@ -407,7 +415,7 @@ class GpuAdvancedSubtensor1(HideC, tensor.AdvancedSubtensor1):
if ilist__.type.dtype != 'int64': if ilist__.type.dtype != 'int64':
ilist__ = tensor.cast(ilist__, 'int64') ilist__ = tensor.cast(ilist__, 'int64')
ilist_ = as_gpuarray_variable(ilist__) ilist_ = as_gpuarray_variable(ilist__, ctx_name)
if ilist_.type.dtype != 'int64': if ilist_.type.dtype != 'int64':
raise TypeError('index must be int64') raise TypeError('index must be int64')
...@@ -419,6 +427,7 @@ class GpuAdvancedSubtensor1(HideC, tensor.AdvancedSubtensor1): ...@@ -419,6 +427,7 @@ class GpuAdvancedSubtensor1(HideC, tensor.AdvancedSubtensor1):
bcast = ilist_.broadcastable + x_.broadcastable[1:] bcast = ilist_.broadcastable + x_.broadcastable[1:]
return gof.Apply(self, [x_, ilist_], return gof.Apply(self, [x_, ilist_],
[GpuArrayType(dtype=x.dtype, [GpuArrayType(dtype=x.dtype,
context_name=ctx_name,
broadcastable=bcast)()]) broadcastable=bcast)()])
def perform(self, node, inp, out_): def perform(self, node, inp, out_):
...@@ -475,8 +484,9 @@ class GpuAdvancedIncSubtensor1(HideC, tensor.AdvancedIncSubtensor1): ...@@ -475,8 +484,9 @@ class GpuAdvancedIncSubtensor1(HideC, tensor.AdvancedIncSubtensor1):
""" """
def make_node(self, x, y, ilist): def make_node(self, x, y, ilist):
x_ = as_gpuarray_variable(x) ctx_name = infer_context_name(x, y)
y_ = as_gpuarray_variable(y) x_ = as_gpuarray_variable(x, ctx_name)
y_ = as_gpuarray_variable(y, ctx_name)
ilist_ = tensor.as_tensor_variable(ilist) ilist_ = tensor.as_tensor_variable(ilist)
assert x_.type.dtype == y_.type.dtype assert x_.type.dtype == y_.type.dtype
...@@ -567,16 +577,16 @@ class GpuAdvancedIncSubtensor1_dev20(GpuKernelBase, GpuAdvancedIncSubtensor1): ...@@ -567,16 +577,16 @@ class GpuAdvancedIncSubtensor1_dev20(GpuKernelBase, GpuAdvancedIncSubtensor1):
only avail on compute capability 2.0 and more recent. only avail on compute capability 2.0 and more recent.
""" """
_f16_ok = True _f16_ok = True
def make_node(self, x, y, ilist): def make_node(self, x, y, ilist):
"""It defer from GpuAdvancedIncSubtensor1 in that it make sure """It defer from GpuAdvancedIncSubtensor1 in that it make sure
the index are of type long. the index are of type long.
""" """
x_ = as_gpuarray_variable(x) ctx_name = infer_context_name(x, y, ilist)
y_ = as_gpuarray_variable(y) x_ = as_gpuarray_variable(x, ctx_name)
ilist_ = as_gpuarray_variable(ilist) y_ = as_gpuarray_variable(y, ctx_name)
ilist_ = as_gpuarray_variable(ilist, ctx_name)
assert x_.type.dtype == y_.type.dtype assert x_.type.dtype == y_.type.dtype
assert x_.type.ndim >= y_.type.ndim assert x_.type.ndim >= y_.type.ndim
...@@ -599,32 +609,30 @@ class GpuAdvancedIncSubtensor1_dev20(GpuKernelBase, GpuAdvancedIncSubtensor1): ...@@ -599,32 +609,30 @@ class GpuAdvancedIncSubtensor1_dev20(GpuKernelBase, GpuAdvancedIncSubtensor1):
return gof.Apply(self, [x_, y_, ilist_], [x_.type()]) return gof.Apply(self, [x_, y_, ilist_], [x_.type()])
def get_context(self, node):
return node.outputs[0].type.context
def perform(self, node, inp, out, ctx):
return super(GpuAdvancedIncSubtensor1_dev20, self).perform(node, inp, out)
def c_code_cache_version(self): def c_code_cache_version(self):
return (6,) return (6,)
def c_headers(self): def c_headers(self):
if pygpu.get_default_context().kind == 'opencl': return ['<numpy_compat.h>', '<gpuarray_helper.h>',
raise MethodNotDefined('cuda only')
return ['cuda.h', '<numpy_compat.h>', '<gpuarray_helper.h>',
'<gpuarray/types.h>'] '<gpuarray/types.h>']
def c_header_dirs(self): def c_header_dirs(self):
if pygpu.get_default_context().kind == 'opencl': return [os.path.dirname(__file__)]
raise MethodNotDefined('cuda only')
cuda_root = config.cuda.root
res = [os.path.dirname(__file__)]
if cuda_root:
res.append(os.path.join(cuda_root, 'include'))
return res
def c_code(self, node, name, inputs, outputs, sub): def c_code(self, node, name, inputs, outputs, sub):
active_device_no = theano.sandbox.cuda.active_device_number() ctx = self.get_context(node)
device_properties = theano.sandbox.cuda.device_properties if ctx.kind != 'cuda':
compute_capability = device_properties(active_device_no)['major'] raise NotImplementedError("cuda only")
if ((self.set_instead_of_inc) or if (self.set_instead_of_inc or
(node.inputs[0].ndim != node.inputs[1].ndim) or node.inputs[0].ndim != node.inputs[1].ndim or
(node.inputs[0].ndim != 2) or node.inputs[0].ndim != 2 or
(compute_capability < 2)): ctx.bin_id[-2] < '2'):
raise NotImplementedError("This case does not have C code yet.") raise NotImplementedError("This case does not have C code yet.")
x = inputs[0] x = inputs[0]
...@@ -754,7 +762,7 @@ __device__ ga_half atomicAdd(ga_half *addr, ga_half val) { ...@@ -754,7 +762,7 @@ __device__ ga_half atomicAdd(ga_half *addr, ga_half val) {
return [Kernel(code=code, name=kname, params=params, return [Kernel(code=code, name=kname, params=params,
flags=flags, objvar=k_var)] flags=flags, objvar=k_var)]
def c_support_code_apply(self, node, nodename): def c_support_code_struct(self, node, nodename):
dtype_x = node.inputs[0].dtype dtype_x = node.inputs[0].dtype
dtype_y = node.inputs[1].dtype dtype_y = node.inputs[1].dtype
dtype_ind = node.inputs[2].dtype dtype_ind = node.inputs[2].dtype
...@@ -765,7 +773,7 @@ __device__ ga_half atomicAdd(ga_half *addr, ga_half val) { ...@@ -765,7 +773,7 @@ __device__ ga_half atomicAdd(ga_half *addr, ga_half val) {
itemsize_out = numpy.dtype(dtype_out).itemsize itemsize_out = numpy.dtype(dtype_out).itemsize
k_var = "k_vector_add_fast_" + nodename k_var = "k_vector_add_fast_" + nodename
return super(GpuAdvancedIncSubtensor1_dev20, self).c_support_code_apply(node, nodename) + """ return super(GpuAdvancedIncSubtensor1_dev20, self).c_support_code_struct(node, nodename) + """
int GpuArray_vector_add_fast(PyGpuArrayObject* py_self, int GpuArray_vector_add_fast(PyGpuArrayObject* py_self,
PyGpuArrayObject* py_other, PyGpuArrayObject* py_other,
PyGpuArrayObject *indices_arr) PyGpuArrayObject *indices_arr)
......
from nose.plugins.skip import SkipTest
import theano.sandbox.gpuarray
if theano.sandbox.gpuarray.pygpu is None:
raise SkipTest("pygpu not installed")
if not theano.sandbox.gpuarray.pygpu_activated:
import theano.sandbox.cuda as cuda_ndarray
if cuda_ndarray.cuda_available:
cuda_ndarray.use('gpu', default_to_move_computation_to_gpu=False,
move_shared_float32_to_gpu=False,
enable_cuda=False)
theano.sandbox.gpuarray.init_dev('cuda')
if not theano.sandbox.gpuarray.pygpu_activated:
raise SkipTest("pygpu disabled")
test_ctx_name = None
if theano.config.mode == 'FAST_COMPILE':
mode_with_gpu = theano.compile.mode.get_mode('FAST_RUN').including('gpuarray').excluding('gpu')
mode_without_gpu = theano.compile.mode.get_mode('FAST_RUN').excluding('gpuarray')
else:
mode_with_gpu = theano.compile.mode.get_default_mode().including('gpuarray').excluding('gpu')
mode_without_gpu = theano.compile.mode.get_default_mode().excluding('gpuarray')
差异被折叠。
差异被折叠。
差异被折叠。
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论