* In CudaNdarray.__{iadd,idiv}__, when it is not implemented, return the error.
* THEANO_FLAGS='optimizer=None' now works as expected
* In one case an AdvancedSubtensor1 could be converted to a GpuAdvancedIncSubtensor1 insted of GpuAdvancedSubtensor1.
* Fixed memory leak in error handling on GPU-to-host copy
It probably didn't happen due to the order of optimizations, but that order is not guaranteed to be the same on all computers.
* Fix relating specifically to Python 2.7 on Mac OS X
* Derivative of set_subtensor was wrong.
* infer_shape can now handle Python longs
* Derivative of Alloc was wrong.
* Trying to compute x % y with one or more arguments being complex now
raises an error.
* The output of random samples computed with uniform(..., dtype=...) is
guaranteed to be of the specified dtype instead of potentially being of a
higher-precision dtype.
* The perform() method of DownsampleFactorMax did not give the right result
when reusing output storage. This happen only if you use the Theano flags
'linker=c|py_nogc' or manually specify the mode to be 'c|py_nogc'.
Crash fixed:
Crash fixed:
* Work around a bug in gcc 4.3.0 that make the compilation of 2d convolution
crash.
* Some optimizations crashed when the "ShapeOpt" optimization was disabled.
Optimization:
* On an unusual Python 2.4.4 on Windows
* Optimize all subtensor followed by subtensor.
* When using a C cache copied from another location
* On Windows 32 bits when setting a complex64 to 0.
* Compilation crash with CUDA 4
* When wanting to copy the compilation cache from a computer to another
* This can be useful for using Theano on a computer without a compiler.
* GPU:
* Compilation crash fixed under Ubuntu 11.04
* Compilation crash fixed with CUDA 4.0
GPU:
Sandbox:
* Move to the gpu fused elemwise that have other dtype then float32 in them
(except float64) if the input and output are float32.
* This allow to move elemwise comparisons to the GPU if we cast it to
float32 after that.
* Implemented CudaNdarray.ndim to have the same interface in ndarray.
* Fixed slowdown caused by multiple chained views on CudaNdarray objects
* CudaNdarray_alloc_contiguous changed so as to never try to free
memory on a view: new "base" property
* Safer decref behaviour in CudaNdarray in case of failed allocations
* New GPU implementation of tensor.basic.outer
* Multinomial random variates now available on GPU
New features:
* MRG random generator now implements the same casting behavior as the regular random generator.
* ProfileMode
* profile the scan overhead
Sandbox New features(not enabled by default):
* simple hook system to add profiler
* reordered the output to be in the order of more general to more specific
* New Linkers (theano flags linker={vm,cvm})
* DebugMode now checks Ops with different patterns of preallocated memory,
* The new linker allows lazy evaluation of the new ifelse op, meaning we compute only the true or false branch depending of the condition. This can speed up some types of computation.
configured by config.DebugMode.check_preallocated_output.
* Uses a new profiling system (that currently tracks less stuff)
* var[vector of index] now work, (grad work recursively, the direct grad
* The cvm is implemented in C, so it lowers Theano's overhead.
work inplace, gpu work)
* The vm is implemented in python. So it can help debugging in some cases.
* limitation: work only of the outer most dimensions.
* In the future, the default will be the cvm.
* New way to test the graph as we build it. Allow to easily find the source
* Some new not yet well tested sparse ops: theano.sparse.sandbox.{SpSum, Diag, SquareDiagonal, ColScaleCSC, RowScaleCSC, Remove0, EnsureSortedIndices, ConvolutionIndices}
* cuda.root inferred if nvcc is on the path, otherwise defaults to
/usr/local/cuda
* Better graph printing for graphs involving a scan subgraph
* Casting behavior can be controlled through config.cast_policy,
new (experimental) mode.
* Smarter C module cache, avoiding erroneous usage of the wrong C
implementation when some options change, and avoiding recompiling the
same module multiple times in some situations.
* The "theano-cache clear" command now clears the cache more thoroughly.
* More extensive linear algebra ops (CPU only) that wrap scipy.linalg
now available in the sandbox.
* CUDA devices 4 - 16 should now be available if present.
* infer_shape support for the View op, better infer_shape support in Scan
* infer_shape supported in all case of subtensor
* tensor.grad now gives an error by default when computing the gradient
wrt a node that is disconnected from the cost (not in the graph, or
no continuous path from that op to the cost).
* New tensor.isnan and isinf functions.
Documentation:
Documentation:
* Better commenting of cuda_ndarray.cu
* Fixes in the scan documentation: add missing declarations/print statements
* How to compute the `Jacobian, Hessian, Jacobian times a vector, Hessian times a vector <http://deeplearning.net/software/theano/tutorial/gradients.html>`_.
* Better error message on failed __getitem__
* Slide for a 3 hours class with exercises that was done at the HPCS2011 Conference in Montreal.
* Updated documentation on profile mode
* Better documentation of testing on Windows
Others:
* Better documentation of the 'run_individual_tests' script
* Logger name renamed to be consistent.
Unit tests:
* Logger function simplified and made more consistent.
* More strict float comparaison by default
* Fixed transformation of error by other not related error with the compute_test_value Theano flag.
* Reuse test for subtensor of tensor for gpu tensor(more gpu test)
* Compilation cache enhancements.
* Tests that check for aliased function inputs and assure appropriate copying
* Made compatible with NumPy 1.6 and SciPy 0.9
(#374)
* Fix tests when there was new dtype in NumPy that is not supported by Theano.
* Better test of copies in CudaNdarray
* Fixed some tests when SciPy is not available.
* New tests relating to the new base pointer requirements
* Don't compile anything when Theano is imported. Compile support code when we compile the first C code.
* Better scripts to run tests individually or in batches
* Python 2.4 fix:
* Some tests are now run whenever cuda is available and not just when it has
* Fix the file theano/misc/check_blas.py
been enabled before
* For python 2.4.4 on Windows, replaced float("inf") with numpy.inf.
* Tests display less pointless warnings.
* Removes useless inputs to a scan node
* Beautification mostly, making the graph more visible. Such inputs would appear as a consequence of other optimizations
Other:
* Correctly put the broadcast flag to True in the output var of
Core:
a Reshape op when we receive an int 1 in the new shape.
* pydotprint: high contrast mode is now the default, option to print
* there is a new mechanism that lets an Op permit that one of its
more compact node names.
inputs to be aliased to another destroyed input. This will generally
* pydotprint: How trunk label that are too long.
result in incorrect calculation, so it should be used with care! The
* More compact printing (ignore leading "Composite" in op names)
right way to use it is when the caller can guarantee that even if
these two inputs look aliased, they actually will never overlap. This
mechanism can be used, for example, by a new alternative approach to
implementing Scan. If an op has an attribute called
"destroyhandler_tolerate_aliased" then this is what's going on.
IncSubtensor is thus far the only Op to use this mechanism.Mechanism
"If 'None', we warn about all Theano bugs found by default. If 'all', we don't warn about Theano bugs found by default. If a version, we print only the warnings relative to Theano bugs found after that version. Warning for specific bugs can be configured with specific [warn] flags.",
"If 'None', we warn about all Theano bugs found by default. If 'all', we don't warn about Theano bugs found by default. If a version, we print only the warnings relative to Theano bugs found after that version. Warning for specific bugs can be configured with specific [warn] flags.",
print'Try to run this script a few times. Experience show that the first time is not as fast as followings call. The difference is not big, but consistent.'
returnt1-t0
returnt1-t0
...
@@ -103,7 +105,7 @@ if __name__ == "__main__":
...
@@ -103,7 +105,7 @@ if __name__ == "__main__":
* manually compiled numpy and ATLAS with 2 threads
* manually compiled numpy and ATLAS with 2 threads
* goto 1.26 with 1, 2, 4 and 8 threads.
* goto 1.26 with 1, 2, 4 and 8 threads.
* goto2 1.13 compiled with multiple thread enabled.
* goto2 1.13 compiled with multiple thread enabled.