提交 f6351e1a authored 作者: James Bergstra's avatar James Bergstra

merge w conflict in cuda_ndarray.cu

......@@ -8,6 +8,15 @@ syntax: glob
*.so
*.sw?
*~
*.aux
*.log
*.nav
*.out
*.pdf
*.snm
*.toc
*.vrb
.noseids
Theano.egg-info
\#*\#
build
......
......@@ -5,6 +5,81 @@
Release Notes
=============
Theano 0.3.1 (2011-02-21)
=========================
Deprecation:
* The theano shared variable attribute `value` is deprecated, use `get_value()` or `set_value()`!
See http://deeplearning.net/software/theano/tutorial/aliasing.html
Bugs fixed:
* The random number generator in theano/sandbox/rng_mrg.py did not always return the same sequence of number on the CPU and GPU.
* In some cases, there was a (possibly large) fraction of non-random garbage in the returned sequence.
* In python mode (not the default mode) when input of elemwise operation was an empty ndarray, we were not returning an empty ndarray.
* Scan cached the number of steps. This caused no problem because each time you called scan the number of steps would got refreshed.
The problem was when you called ScanGrad which would use the cached number of steps without refreshing it.
To be affected by this bug, one would have to compile two graph, one that would contain a Scan and the other the corresponding GradScan, and
call the first function to cache the number of steps, and then call the second function with a different number of steps.
* In GpuConv, errors in conv_patch_stack_reduce when the entire kernel doesn't fit into shared memory.
The error was not found before as the impact was less then the relative tolerance of 1e-3. Now the relative tolerance is 1e-5.
Crash fixed:
* Add a feature to not have an exception that makes Theano crash when taking the gradient on DimShuffle in some particular case.
* Compilation crash for GpuElemwise with tensor with high number of dimensions (~6 or more).
* Disabled C code generator that make gcc crash on complex type.
* Crash in optimization when an Op has no input.
* Output shape is now computed correctly for matrix-vector multiplication on GPU.
* In Scan, when using numbers as inputs, not symbolic variables.
* In GradScan, when there is only 1 inputs in the Scan.
* In GpuSum, bug in calculation of n_blocks for the 10 pattern. (Sum on the row of a matrix)
* Some segfault at exit with GPU code.
Optimization:
* New SpecifyShape op that allow to pass more shape info in the graph.
* Speed up gemv by a work around scipy gemv slowness when the matrix is in C order (the default).
* Remove join of only 1 element.
* During optimization, consider one more case in get_constant_value.
GPU:
* cuda_shared.value = X now works inplace!
* cuda_shared_var.set_value(new_ndarray) will overwrite the old value inplace in the most common case.
* Allow to create a CudaNdarraySharedVariable from a CudaNdarray.
* New init_gpu_device theano flags.
* Fuse GpuElemwise more often (in the case where there are so many inputs that fusing them all would bust the 256 bytes limit of parameter to gpu function).
* CPU join of only 1 element that was not moved to the GPU.
New features:
* tensor.reshape now makes dimensions of length 1 broadcastable.
* tensor.prod now implements the gradient.
* DebugMode now warns if an Op declared itself as returning a view of the input but did not do so.
* This behaviour is a problem, because it can block other Ops from being inplace on the same inputs. This could lower the reuse of memory.
* Sparse.structured_dot now works when both matrices are sparse
* Sparse type is now supported by the shape op, and the ShapeFeature optimizer works correctly with them.
* New 3D convolution ops, with CPU and GPU implementations.
* New colors in pydotprint.
Documentation:
* Documented lib.amdlibm and (new) init_gpu_device config variables.
* A new page (was done for 0.3 but an error was hiding it on the web page) on the memory aliasing contract of Theano.
* Revision to the Windows installation instructions.
* The cuda documentation is now generated on the web server.
* Better documentation of .theanorc and its sections.
Unit tests:
* Stop usage of deprecated functions or syntax in the unit tests.
* Better testing of GPU convolution nets.
* Make more tests able to use different random seeds.
* Tests of sparse now use default mode, not a hard-coded one.
* Remove some tests of unimplemented features.
Other:
* The name of compiledir now includes the Python version to make it easier for people with many Python versions
* Added theano.tensor.std as a shortcut to sqrt(var(input=input, axis=axis)).
* Whitespace, tabulation and indentation clean-up in the code.
* Better detection of memory sharing between variables.
Theano 0.3 (2010-11-23)
=======================
......
Modifications in the trunk since the last release
In trunk since 0.3.1 release
----------------------------
GPU:
* Move to the gpu fused elemwise that have other dtype then float32 in them(except float64) if the input and output are float32.
* This allow to move elemwise comparaison to the gpu if we cast it to float32 after that.
Theano 0.3.1 (2011-02-21)
----------------------------
Theano 0.4.0rc4 (2011-06-13)
--------------------------------------------------
Deprecation:
* The theano shared variable attribute `value` is deprecated, use `get_value()` or `set_value()`!
See http://deeplearning.net/software/theano/tutorial/aliasing.html
* tag.shape attribute deprecated (#633)
* CudaNdarray_new_null is deprecated in favour of CudaNdarray_New
* Dividing integers with / is deprecated: use // for integer division, or
cast one of the integers to a float type if you want a float result (you may
also change this behavior with config.int_division).
* Removed (already deprecated) sandbox/compile module
* Removed (already deprecated) incsubtensor and setsubtensor functions,
inc_subtensor and set_subtensor are to be used instead.
Bugs fixed:
* The random number generator in theano/sandbox/rng_mrg.py did not always return the same sequence of number on the CPU and GPU.
* In some cases, there was a (possibly large) fraction of non-random garbage in the returned sequence.
* In python mode (not the default mode) when input of elemwise operation was an empty ndarray, we were not returning an empty ndarray.
* Scan cached the number of steps. This caused no problem because each time you called scan the number of steps would got refreshed.
The problem was when you called ScanGrad which would use the cached number of steps without refreshing it.
To be affected by this bug, one would have to compile two graph, one that would contain a Scan and the other the corresponding GradScan, and
call the first function to cache the number of steps, and then call the second function with a different number of steps.
* In GpuConv, errors in conv_patch_stack_reduce when the entire kernel doesn't fit into shared memory.
The error was not found before as the impact was less then the relative tolerance of 1e-3. Now the relative tolerance is 1e-5.
* In CudaNdarray.__{iadd,idiv}__, when it is not implemented, return the error.
* THEANO_FLAGS='optimizer=None' now works as expected
* Fixed memory leak in error handling on GPU-to-host copy
* Fix relating specifically to Python 2.7 on Mac OS X
* infer_shape can now handle Python longs
* Trying to compute x % y with one or more arguments being complex now
raises an error.
* The output of random samples computed with uniform(..., dtype=...) is
guaranteed to be of the specified dtype instead of potentially being of a
higher-precision dtype.
* The perform() method of DownsampleFactorMax did not give the right result
when reusing output storage. This happen only if you use the Theano flags
'linker=c|py_nogc' or manually specify the mode to be 'c|py_nogc'.
Crash fixed:
* Add a feature to not have an exception that makes Theano crash when taking the gradient on DimShuffle in some particular case.
* Compilation crash for GpuElemwise with tensor with high number of dimensions (~6 or more).
* Disabled C code generator that make gcc crash on complex type.
* Crash in optimization when an Op has no input.
* Output shape is now computed correctly for matrix-vector multiplication on GPU.
* In Scan, when using numbers as inputs, not symbolic variables.
* In GradScan, when there is only 1 inputs in the Scan.
* In GpuSum, bug in calculation of n_blocks for the 10 pattern. (Sum on the row of a matrix)
* Some segfault at exit with GPU code.
* Work around a bug in gcc 4.3.0 that make the compilation of 2d convolution
crash.
* Some optimizations crashed when the "ShapeOpt" optimization was disabled.
Optimization:
* New SpecifyShape op that allow to pass more shape info in the graph.
* Speed up gemv by a work around scipy gemv slowness when the matrix is in C order (the default).
* Remove join of only 1 element.
* During optimization, consider one more case in get_constant_value.
* Optimize all subtensor followed by subtensor.
GPU:
* cuda_shared.value = X now works inplace!
* cuda_shared_var.set_value(new_ndarray) will overwrite the old value inplace in the most common case.
* Allow to create a CudaNdarraySharedVariable from a CudaNdarray.
* New init_gpu_device theano flags.
* Fuse GpuElemwise more often (in the case where there are so many inputs that fusing them all would bust the 256 bytes limit of parameter to gpu function).
* CPU join of only 1 element that was not moved to the GPU.
* Move to the gpu fused elemwise that have other dtype then float32 in them
(except float64) if the input and output are float32.
* This allow to move elemwise comparisons to the GPU if we cast it to
float32 after that.
* Implemented CudaNdarray.ndim to have the same interface in ndarray.
* Fixed slowdown caused by multiple chained views on CudaNdarray objects
* CudaNdarray_alloc_contiguous changed so as to never try to free
memory on a view: new "base" property
* Safer decref behaviour in CudaNdarray in case of failed allocations
* New GPU implementation of tensor.basic.outer
* Multinomial random variates now available on GPU
New features:
* tensor.reshape now makes dimensions of length 1 broadcastable.
* tensor.prod now implements the gradient.
* DebugMode now warns if an Op declared itself as returning a view of the input but did not do so.
* This behaviour is a problem, because it can block other Ops from being inplace on the same inputs. This could lower the reuse of memory.
* Sparse.structured_dot now works when both matrices are sparse
* Sparse type is now supported by the shape op, and the ShapeFeature optimizer works correctly with them.
* New 3D convolution ops, with CPU and GPU implementations.
* New colors in pydotprint.
* ProfileMode
* profile the scan overhead
* simple hook system to add profiler
* reordered the output to be in the order of more general to more specific
* DebugMode now checks Ops with different patterns of preallocated memory,
configured by config.DebugMode.check_preallocated_output.
* var[vector of index] now work, (grad work recursively, the direct grad
work inplace, gpu work)
* limitation: work only of the outer most dimensions.
* New way to test the graph as we build it. Allow to easily find the source
of shape mismatch error:
`http://deeplearning.net/software/theano/tutorial/debug_faq.html#interactive-debugger`__
* cuda.root inferred if nvcc is on the path, otherwise defaults to
/usr/local/cuda
* Better graph printing for graphs involving a scan subgraph
* Casting behavior can be controlled through config.cast_policy,
new (experimental) mode.
* Smarter C module cache, avoiding erroneous usage of the wrong C
implementation when some options change, and avoiding recompiling the
same module multiple times in some situations.
* The "theano-cache clear" command now clears the cache more thoroughly.
* More extensive linear algebra ops (CPU only) that wrap scipy.linalg
now available in the sandbox.
* CUDA devices 4 - 16 should now be available if present.
* infer_shape support for the View op, better infer_shape support in Scan
* infer_shape supported in all case of subtensor
* tensor.grad now gives an error by default when computing the gradient
wrt a node that is disconnected from the cost (not in the graph, or
no continuous path from that op to the cost).
* New tensor.isnan and isinf functions.
Documentation:
* Documented lib.amdlibm and (new) init_gpu_device config variables.
* A new page (was done for 0.3 but an error was hiding it on the web page) on the memory aliasing contract of Theano.
* Revision to the Windows installation instructions.
* The cuda documentation is now generated on the web server.
* Better documentation of .theanorc and its sections.
* Better commenting of cuda_ndarray.cu
* Fixes in the scan documentation: add missing declarations/print statements
* Better error message on failed __getitem__
* Updated documentation on profile mode
* Better documentation of testing on Windows
* Better documentation of the 'run_individual_tests' script
Unit tests:
* Stop usage of deprecated functions or syntax in the unit tests.
* Better testing of GPU convolution nets.
* Make more tests able to use different random seeds.
* Tests of sparse now use default mode, not a hard-coded one.
* Remove some tests of unimplemented features.
* More strict float comparaison by default
* Reuse test for subtensor of tensor for gpu tensor(more gpu test)
* Tests that check for aliased function inputs and assure appropriate copying
(#374)
* Better test of copies in CudaNdarray
* New tests relating to the new base pointer requirements
* Better scripts to run tests individually or in batches
* Some tests are now run whenever cuda is available and not just when it has
been enabled before
* Tests display less pointless warnings.
Other:
* The name of compiledir now includes the Python version to make it easier for people with many Python versions
* Added theano.tensor.std as a shortcut to sqrt(var(input=input, axis=axis)).
* Whitespace, tabulation and indentation clean-up in the code.
* Better detection of memory sharing between variables.
* Correctly put the broadcast flag to True in the output var of
a Reshape op when we receive an int 1 in the new shape.
* pydotprint: high contrast mode is now the default, option to print
more compact node names.
* pydotprint: How trunk label that are too long.
* More compact printing (ignore leading "Composite" in op names)
......@@ -188,10 +188,10 @@ def execs_timeit_2vector(exprs, fname=None):
pylab.subplots_adjust(wspace=0.25, hspace=0.25)
#legend=[]
#plot = fig.add_subplot(1,len(exprs),idx)
speedup = [t[0].min()/t[1].min() for t in time]
speedup = [t["numpy"].min()/t["numexpr"].min() for t in time]
pylab.semilogx(nb_calls, speedup, linewidth=1.0, linestyle = '--', color='r')
speedup = [t[0].min()/t[2].min() for t in time]
pylab.semilogx(nb_calls, speedup, linewidth=1.0, color='r')
speedup = [t["numpy"].min()/t["theano"].min() for t in time]
pylab.semilogx(nb_calls, speedup, linewidth=1.0, color = 'b')
pylab.grid(True)
if (idx == 2) or (idx == 3):
......
#!/usr/bin/env python
import sys
import logging, os, sys
from theano import config
from theano.gof.cc import get_module_cache
_logger = logging.getLogger('theano.bin.theano-cache')
_logger.setLevel(logging.WARN)
if len(sys.argv) == 1:
print config.compiledir
elif sys.argv[1] in ('clear'):
get_module_cache().clear()
# We skip the refresh on module cache creation because the refresh will
# be done when calling clear afterwards.
cache = get_module_cache(init_args=dict(do_refresh=False))
cache.clear(unversioned_min_age=-1, clear_base_files=True,
delete_if_problem=True)
# Print a warning if some cached modules were not removed, so that the user
# knows he should manually delete them to properly clear the cache.
items = [item for item in sorted(os.listdir(cache.dirname))
if item.startswith('tmp')]
if items:
_logger.warning('There remain elements in the cache dir that you may '
'need to erase manually. The cache dir is:\n %s' %
config.compiledir)
_logger.debug('Remaining elements (%s): %s' %
(len(items), ', '.join(items)))
else:
print 'command "%s" not recognized' % sys.argv[1]
print 'Type "theano-cache" to print the cache location'
......
......@@ -51,9 +51,9 @@ copyright = '2008--2011, LISA lab'
# other places throughout the built documents.
#
# The short X.Y version.
version = '0.3.1'
version = '0.4'
# The full version, including alpha/beta/rc tags.
release = '0.3.1'
release = '0.4.0rc4'
# There are two options for replacing |today|: either, you set today to some
# non-false value, then it is used:
......
.. _developer:
======================
==============================================
Theano Design and Implementation Documentation
======================
==============================================
.. toctree::
......
......@@ -7,7 +7,7 @@ Tensor
This file describes the design of theano.tensor.
Elemwise grad and R_op
=================
======================
Here's another straightforward example, though a bit more elaborate
than adding two numbers together. Let's say that you want to compute
......
......@@ -74,7 +74,8 @@ following methods:
A function Mode may allow ``output_storage`` elements to persist between
evaluations, or it may reset ``output_storage`` cells to hold a value of
``None``. This feature can allow ``perform`` to reuse memory between
``None``. It can also pre-allocate some memory for the Op to use.
This feature can allow ``perform`` to reuse memory between
calls, for example.
This method must be determined by the inputs. That is to say, if
......
......@@ -108,7 +108,7 @@ case if ``borrow`` was True, the thunk would be allowed to reuse (or
The compile cache is based upon the C++ code of the graph to be compiled.
So, if you change compilation configuration variables, such as
:attr:`config.blas.ldflags`, you will need to manually remove your compile cache,
using ``Theano/bin/theano-compiledir clear``
using ``Theano/bin/theano-cache clear``
Theano also implements a lock mechanism that prevents
multiple compilations within the same compilation directory (to avoid
......
......@@ -32,16 +32,27 @@ default values.
reference to ``value`` (i.e. casting prohibited).
If ``strict`` is False, then casting may happen, but downcasting should
only be used in two situations:
* if ``allow_downcast`` is True
* if ``allow_downcast`` is ``None`` and the default behavior for this
type allows downcasting for the given ``value`` (this behavior is
type-dependent, you may decide what your own type does by default)
* if ``allow_downcast`` is True
* if ``allow_downcast`` is ``None`` and the default behavior for this
type allows downcasting for the given ``value`` (this behavior is
type-dependent, you may decide what your own type does by default)
We need to define ``filter`` with three arguments. The second argument
must be called ``strict`` (Theano often calls it by keyword) and must
have a default value of ``False``. The third argument must be called
``allow_downcast`` and must have a default value of ``None``.
.. method:: filter_inplace(value, storage, strict=False, allow_downcast=None)
If filter_inplace is defined, it will be called instead of
filter() This is to allow reusing the old allocated memory. As
of this writing this is used only when we transfer new data to a
shared variable on the gpu.
``storage`` will be the old value. i.e. The old numpy array,
CudaNdarray, ...
.. method:: is_valid_value(value)
Returns True iff the value is compatible with the Type. If
......
all: presentation.pdf
presentation.pdf: presentation.tex pics/f_optimized.png pics/logreg_pydotprint_prediction.png
# pics/f_unoptimized.png pics/logreg_pydotprint_predic.png pics/logreg_pydotprint_train.png
pdflatex presentation.tex
pics/f_optimized.png: simple_example.py
python simple_example.py
pics/logreg_pydotprint_prediction.png: logreg_example.py
python logreg_example.py
#pics/f_unoptimized.png: simple_example.py
# python simple_example.py
import numpy
import theano
class DoubleOp(theano.Op):
def __eq__(self, other):
return type(self) == type(other)
def __hash__(self):
return hash(type(self))
def __str__(self):
return self.__class__.__name__
def make_node(self, x):
x = theano.tensor.as_tensor_variable(x)
return theano.Apply(self, [x], [x.type()])
def perform(self, node, inputs, output_storage):
x = inputs[0]
z = output_storage[0]
z[0] = x * 2
x = theano.tensor.matrix()
f = theano.function([x], DoubleOp()(x))
inp = numpy.random.rand(5,5)
out = f(inp)
assert numpy.allclose(inp*2, out)
print inp
print out
import numpy
import theano
import theano.tensor as T
rng = numpy.random
N = 400
feats = 784
D = (rng.randn(N, feats).astype(theano.config.floatX), rng.randint(size=N,low=0, high=2).astype(theano.config.floatX))
training_steps = 10000
# Declare Theano symbolic variables
x = T.matrix("x")
y = T.vector("y")
w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
x.tag.test_value = D[0]
y.tag.test_value = D[1]
#print "Initial model:"
#print w.get_value(), b.get_value()
# Construct Theano expression graph
p_1 = 1 / (1 + T.exp(-T.dot(x, w)-b)) # Probabily of having a one
prediction = p_1 > 0.5 # The prediction that is done: 0 or 1
xent = -y*T.log(p_1) - (1-y)*T.log(1-p_1) # Cross-entropy
cost = xent.mean() + 0.01*(w**2).sum() # The cost to optimize
gw,gb = T.grad(cost, [w,b])
# Compile expressions to functions
train = theano.function(
inputs=[x,y],
outputs=[prediction, xent],
updates={w:w-0.01*gw, b:b-0.01*gb},
name = "train")
predict = theano.function(inputs=[x], outputs=prediction,
name = "predict")
if any( [x.op.__class__.__name__=='Gemv' for x in train.maker.env.toposort()]):
print 'Used the cpu'
elif any( [x.op.__class__.__name__=='GpuGemm' for x in train.maker.env.toposort()]):
print 'Used the gpu'
else:
print 'ERROR, not able to tell if theano used the cpu or the gpu'
print train.maker.env.toposort()
for i in range(training_steps):
pred, err = train(D[0], D[1])
#print "Final model:"
#print w.get_value(), b.get_value()
print "target values for D"
print D[1]
print "prediction on D"
print predict(D[0])
# Print the graph used in the slides
theano.printing.pydotprint(predict,
outfile="pics/logreg_pydotprint_predic.png",
var_with_name_simple=True)
theano.printing.pydotprint_variables(prediction,
outfile="pics/logreg_pydotprint_prediction.png",
var_with_name_simple=True)
theano.printing.pydotprint(train,
outfile="pics/logreg_pydotprint_train.png",
var_with_name_simple=True)
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论