Prepare final release 0.9.0.

caefd498 · notoraptor · 127ae529 · caefd498 · caefd498 · caefd498
--- a/.mailmap
+++ b/.mailmap
@@ -222,3 +222,5 @@ Vivek Kulkarni <viveksck@gmail.com> Vivek Kulkarni <vvkulkarni@cs.stonybrook.edu
 Wei Li <kuantkid@gmail.com> kuantkid <kuantkid@gmail.com>
 Yoshua Bengio <bengioy@iro.umontreal.ca> bengioy@bengio-mac.local <bengioy@bengio-mac.local>
 Ziye Fan <fanziye.cis@gmail.com> FanZiye(t13m) <fanziye.cis@gmail.com>
+Zhouhan LIN <lin.zhouhan@gmail.com> hantek <lin.zhouhan@gmail.com>
+Zhouhan LIN <lin.zhouhan@gmail.com> Zhouhan LIN <hantek@Zhouhans-MacBook-Pro.local>
--- a/HISTORY.txt
+++ b/HISTORY.txt
@@ -5,6 +5,358 @@
 Old Release Notes
 =================
+Theano 0.9.0rc4 (13th of March, 2017)
+=====================================
+This release extends the 0.9.0rc3 and announces the upcoming final release 0.9.
+Highlights (since 0.9.0rc3):
+ - Documentation updates
+ - DebugMode fixes, cache cleanup fixes and other small fixes
+ - New GPU back-end:
+   - Fixed offset error in GpuIncSubtensor
+   - Fixed indexing error in GpuAdvancedSubtensor for more than 2 dimensions
+A total of 5 people contributed to this release since 0.9.0rc3 and 123 since 0.8.0, see the lists below.
+Committers since 0.9.0rc3:
+ - Frederic Bastien
+ - Pascal Lamblin
+ - Arnaud Bergeron
+ - Cesar Laurent
+ - Martin Drawitsch
+Theano 0.9.0rc3 (6th of March, 2017)
+====================================
+This release extends the 0.9.0rc2 and announces the upcoming final release 0.9.
+Highlights (since 0.9.0rc2):
+ - Graph clean up and faster compilation
+ - New Theano flag conv.assert_shape to check user-provided shapes at runtime (for debugging)
+ - Fix overflow in pooling
+ - Warn if taking softmax over broadcastable dimension
+ - Removed old files not used anymore
+ - Test fixes and crash fixes
+ - New GPU back-end:
+   - Removed warp-synchronous programming, to get good results with newer CUDA drivers
+A total of 5 people contributed to this release since 0.9.0rc2 and 122 since 0.8.0, see the lists below.
+Committers since 0.9.0rc2:
+ - Frederic Bastien
+ - Arnaud Bergeron
+ - Pascal Lamblin
+ - Florian Bordes
+ - Jan Schlüter
+Theano 0.9.0rc2 (27th of February, 2017)
+========================================
+This release extends the 0.9.0rc1 and announces the upcoming final release 0.9.
+Highlights (since 0.9.0rc1):
+ - Fixed dnn conv grad issues
+ - Allowed pooling of empty batch
+ - Use of 64-bit indexing in sparse ops to allow matrix with more then 2\ :sup:`31`\ -1 elements.
+ - Removed old benchmark directory
+ - Crash fixes, bug fixes, warnings improvements, and documentation update
+A total of 9 people contributed to this release since 0.9.0rc1 and 121 since 0.8.0, see the lists below.
+Committers since 0.9.0rc1:
+ - Frederic Bastien
+ - Pascal Lamblin
+ - Steven Bocco
+ - Simon Lefrancois
+ - Lucas Beyer
+ - Michael Harradon
+ - Rebecca N. Palmer
+ - David Bau
+ - Micah Bojrab
+Theano 0.9.0rc1 (20th of February, 2017)
+========================================
+This release extends the 0.9.0beta1 and announces the upcoming final release 0.9.
+Highlights (since 0.9.0beta1):
+ - Better integration of Theano+libgpuarray packages into conda distribution
+ - Better handling of Windows end-lines into C codes
+ - Better compatibility with NumPy 1.12
+ - Faster scan optimizations
+ - Fixed broadcast checking in scan
+ - Bug fixes related to merge optimizer and shape inference
+ - many other bug fixes and improvements
+ - Updated documentation
+ - New GPU back-end:
+   - Value of a shared variable is now set inplace
+A total of 26 people contributed to this release since 0.9.0beta1 and 117 since 0.8.0, see the list at the bottom.
+Interface changes:
+ - In MRG, replaced method `multinomial_wo_replacement()` with new method `choice()`
+Convolution updates:
+ - Implement conv2d_transpose convenience function
+GPU:
+ - GPUMultinomialFromUniform op now supports multiple dtypes
+New features:
+ - OpFromGraph now allows gradient overriding for every input
+ - Added Abstract Ops for batch normalization that use cuDNN when available and pure Theano CPU/GPU alternatives otherwise
+ - Added new Theano flag cuda.enabled
+ - Added new Theano flag print_global_stats to print some global statistics (time spent) at the end
+Others:
+ - Split op now has C code for CPU and GPU
+ - "theano-cache list" now includes compilation times
+Committers since 0.9.0beta1:
+ - Frederic Bastien
+ - Benjamin Scellier
+ - khaotik
+ - Steven Bocco
+ - Arnaud Bergeron
+ - Pascal Lamblin
+ - Gijs van Tulder
+ - Reyhane Askari
+ - Chinnadhurai Sankar
+ - Vincent Dumoulin
+ - Alexander Matyasko
+ - Cesar Laurent
+ - Nicolas Ballas
+ - affanv14
+ - Faruk Ahmed
+ - Anton Chechetka
+ - Alexandre de Brebisson
+ - Amjad Almahairi
+ - Dimitar Dimitrov
+ - Fuchai
+ - Jan Schlüter
+ - Jonas Degrave
+ - Mathieu Germain
+ - Rebecca N. Palmer
+ - Simon Lefrancois
+ - valtron
+Theano 0.9.0beta1 (24th of January, 2017)
+=========================================
+This release contains a lot of bug fixes and improvements + new features, to prepare the upcoming release candidate.
+Highlights:
+ - Many computation and compilation speed up
+ - More numerical stability by default for some graph
+ - Jenkins (gpu tests run on PR in addition to daily buildbot)
+ - Better handling of corner cases for theano functions and graph optimizations
+ - More graph optimization (faster execution and smaller graph, so more readable)
+ - Less c code compilation
+ - Better Python 3.5 support
+ - Better numpy 1.12 support
+ - Support newer Mac and Windows version
+ - Conda packages for Mac, Linux and Windows
+ - Theano scripts now works on Windows
+ - scan with checkpoint (trade off between speed and memory usage, useful for long sequences)
+ - Added a bool dtype
+ - New GPU back-end:
+   - float16 storage
+   - better mapping between theano device number and nvidia-smi number, using the PCI bus ID of graphic cards
+   - More pooling support on GPU when cuDNN isn't there
+   - ignore_border=False is now implemented for pooling
+A total of 111 people contributed to this release since 0.8.0, see the list at the bottom.
+Interface changes:
+ - New pooling interface
+ - Pooling parameters can change at run time
+ - When converting empty list/tuple, now we use floatX dtype
+ - The MRG random generator now try to infer the broadcast pattern of its output
+ - Move softsign out of sandbox to theano.tensor.nnet.softsign
+ - Roll make the shift be modulo the size of the axis we roll on
+ - Merge CumsumOp/CumprodOp into CumOp
+ - round() default to the same as NumPy: half_to_even
+Convolution updates:
+ - Multi-cores convolution and pooling on CPU
+ - New abstract 3d convolution interface similar to the 2d convolution interface
+ - Dilated convolution
+GPU:
+ - cuDNN: support versoin 5.1 and wrap batch normalization (2d and 3d) and RNN functions
+ - Multiple-GPU, synchrone update (via platoon, use NCCL)
+ - GpuAdvancedSubtensor in new back-end
+ - Gemv(matrix-vector product) speed up for special shape
+ - Support for MaxAndArgMax for some axis combination
+ - Support for solve (using cusolver), erfinv and erfcinv
+ - cublas gemv workaround when we reduce on an axis with a dimensions size of 0
+ - Warn user that some cuDNN algorithms may produce unexpected results in certain environments
+   for convolution backward filter operations
+New features:
+ - Add gradient of solve, tensorinv (CPU), tensorsolve (CPU) searchsorted (CPU)
+ - Add Multinomial Without Replacement
+ - conv3d2d support full and half mode (REMOVE?)
+ - Add DownsampleFactorMaxGradGrad.grad
+ - Allow partial evaluation of compiled function
+ - More Rop support
+ - Indexing support ellipsis: a[..., 3], a[1,...,3]
+ - Added theano.tensor.{tensor5,dtensor5, ...}
+ - compiledir_format support device
+ - Added new Theano flag cmodule.age_thresh_use
+Others:
+ - Speed up argmax only on gpu (without also needing the max)
+ - A few unfrequent bugfix
+ - More stack trace in error message
+ - Speed up cholesky grad
+ - log(sum(exp(...))) now get stability optimized
+Other more detailed changes:
+ - Allow more then one output to be an destructive inplace
+ - Add flag profiling.ignore_first_call, useful to profile the new gpu back-end
+ - Doc/error message fixes/updates
+ - More support of negative axis
+ - Added the keepdims parameter to the norm function
+ - Crash fixes
+ - Make scan gradient more deterministic
+ - Add support for space in path on Windows
+ - remove ProfileMode (use Theano flag profile=True instead)
+Committers since 0.8.0:
+ - Frederic Bastien
+ - Arnaud Bergeron
+ - Pascal Lamblin
+ - Ramana Subramanyam
+ - Simon Lefrancois
+ - Steven Bocco
+ - Gijs van Tulder
+ - Cesar Laurent
+ - Chiheb Trabelsi
+ - Chinnadhurai Sankar
+ - Mohammad Pezeshki
+ - Reyhane Askari
+ - Alexander Matyasko
+ - Alexandre de Brebisson
+ - Nan Rosemary Ke
+ - Pierre Luc Carrier
+ - Mathieu Germain
+ - Olivier Mastropietro
+ - khaotik
+ - Saizheng Zhang
+ - Thomas George
+ - Iulian Vlad Serban
+ - Benjamin Scellier
+ - Francesco Visin
+ - Caglar
+ - Harm de Vries
+ - Samira Shabanian
+ - Jakub Sygnowski
+ - Samira Ebrahimi Kahou
+ - Mikhail Korobov
+ - Faruk Ahmed
+ - Fei Wang
+ - Jan Schlüter
+ - Kv Manohar
+ - Jesse Livezey
+ - Kelvin Xu
+ - Matt Graham
+ - Ruslana Makovetsky
+ - Sina Honari
+ - Bryn Keller
+ - Ciyong Chen
+ - Nicolas Ballas
+ - Vitaliy Kurlin
+ - Zhouhan LIN
+ - Gokula Krishnan
+ - Kumar Krishna Agrawal
+ - Ozan Çağlayan
+ - Vincent Michalski
+ - Ray Donnelly
+ - Tim Cooijmans
+ - Vincent Dumoulin
+ - happygds
+ - mockingjamie
+ - Amjad Almahairi
+ - Christos Tsirigotis
+ - Ilya Kulikov
+ - RadhikaG
+ - Taesup (TS) Kim
+ - Ying Zhang
+ - Karthik Karanth
+ - Kirill Bobyrev
+ - Yang Zhang
+ - Yaroslav Ganin
+ - Liwei Cai
+ - Morgan Stuart
+ - Tim Gasper
+ - Xavier Bouthillier
+ - p
+ - texot
+ - Andrés Gottlieb
+ - Ben Poole
+ - Bhavishya Pohani
+ - Carl Thomé
+ - Evelyn Mitchell
+ - Fei Zhan
+ - Fábio Perez
+ - Gennadiy Tupitsin
+ - Gilles Louppe
+ - Greg Ciccarelli
+ - He
+ - Huan Zhang
+ - Jonas Degrave
+ - Kaixhin
+ - Kevin Keraudren
+ - Maltimore
+ - Marc-Alexandre Cote
+ - Marco
+ - Marius F. Killinger
+ - Maxim Kochurov
+ - Neil
+ - Nizar Assaf
+ - Rithesh Kumar
+ - Rizky Luthfianto
+ - Robin Millette
+ - Roman Ring
+ - Sander Dieleman
+ - Sebastin Santy
+ - Shawn Tan
+ - Wazeer Zulfikar
+ - Wojciech Głogowski
+ - Yann N. Dauphin
+ - gw0 [http://gw.tnode.com/]
+ - hexahedria
+ - hsintone
+ - jakirkham
+ - joncrall
+ - root
+ - superantichrist
+ - tillahoffmann
+ - wazeerzulfikar
+ - you-n-g
 Theano 0.8.2 (21th of April, 2016)
 ==================================

--- a/NEWS.txt
+++ b/NEWS.txt
@@ -3,334 +3,253 @@ Release Notes
 =============
-Theano 0.9.0rc4 (13th of March, 2017)
+Theano 0.9.0 (20th of March, 2017)
-=====================================
+==================================
-This release extends the 0.9.0rc3 and announces the upcoming final release 0.9.
+This is a final release of Theano, version ``0.9.0``, with a lot of
+new features, interface changes, improvements and bug fixes.
-Highlights (since 0.9.0rc3):
+We recommend that everybody update to this version.
- - Documentation updates
- - DebugMode fixes, cache cleanup fixes and other small fixes
- - New GPU back-end:
-   - Fixed offset error in GpuIncSubtensor
-   - Fixed indexing error in GpuAdvancedSubtensor for more than 2 dimensions
-A total of 5 people contributed to this release since 0.9.0rc3 and 123 since 0.8.0, see the lists below.
+Highlights (since 0.8.0):
+ - Better Python 3.5 support
+ - Better numpy 1.12 support
+ - Conda packages for Mac, Linux and Windows
+ - Support newer Mac and Windows versions
+ - More Windows integration:
-Committers since 0.9.0rc3:
+   - Theano scripts (``theano-cache`` and ``theano-nose``) now works on Windows
- - Frederic Bastien
+   - Better support for Windows end-lines into C codes
- - Pascal Lamblin
+   - Support for space in paths on Windows
- - Arnaud Bergeron
- - Cesar Laurent
- - Martin Drawitsch
+ - Scan improvements:
-Theano 0.9.0rc3 (6th of March, 2017)
+   - More scan optimizations, with faster compilation and gradient computation
-====================================
+   - Support for checkpoint in scan (trade off between speed and memory usage, useful for long sequences)
+   - Fixed broadcast checking in scan
-This release extends the 0.9.0rc2 and announces the upcoming final release 0.9.
+ - Graphs improvements:
-Highlights (since 0.9.0rc2):
+   - More numerical stability by default for some graphs
- - Graph clean up and faster compilation
+   - Better handling of corner cases for theano functions and graph optimizations
- - New Theano flag conv.assert_shape to check user-provided shapes at runtime (for debugging)
+   - More graph optimizations with faster toposort, compilation and execution
- - Fix overflow in pooling
+   - smaller and more readable graph
- - Warn if taking softmax over broadcastable dimension
- - Removed old files not used anymore
- - Test fixes and crash fixes
 - New GPU back-end:
-   - Removed warp-synchronous programming, to get good results with newer CUDA drivers
+   - Removed warp-synchronous programming to get good results with newer CUDA drivers
+   - More pooling support on GPU when cuDNN isn't available
-A total of 5 people contributed to this release since 0.9.0rc2 and 122 since 0.8.0, see the lists below.
+   - Full support of ignore_border option for pooling
+   - Inplace storage for shared variables
+   - float16 storage
-Committers since 0.9.0rc2:
+   - Using PCI bus ID of graphic cards for a better mapping between theano device number and nvidia-smi number
- - Frederic Bastien
+   - Added useful stats for GPU in profile mode
- - Arnaud Bergeron
+   - Added documentation for GPU float16 ops
- - Pascal Lamblin
+   - Fixed offset error in ``GpuIncSubtensor``
- - Florian Bordes
- - Jan Schlüter
+ - Less C code compilation
+ - Added support for bool dtype
+ - Updated and more complete documentation
-Theano 0.9.0rc2 (27th of February, 2017)
-========================================
-This release extends the 0.9.0rc1 and announces the upcoming final release 0.9.
-Highlights (since 0.9.0rc1):
- - Fixed dnn conv grad issues
- - Allowed pooling of empty batch
- - Use of 64-bit indexing in sparse ops to allow matrix with more then 2\ :sup:`31`\ -1 elements.
- - Removed old benchmark directory
- - Crash fixes, bug fixes, warnings improvements, and documentation update
-A total of 9 people contributed to this release since 0.9.0rc1 and 121 since 0.8.0, see the lists below.
-Committers since 0.9.0rc1:
- - Frederic Bastien
- - Pascal Lamblin
- - Steven Bocco
- - Simon Lefrancois
- - Lucas Beyer
- - Michael Harradon
- - Rebecca N. Palmer
- - David Bau
- - Micah Bojrab
-Theano 0.9.0rc1 (20th of February, 2017)
-========================================
-This release extends the 0.9.0beta1 and announces the upcoming final release 0.9.
-Highlights (since 0.9.0beta1):
- - Better integration of Theano+libgpuarray packages into conda distribution
- - Better handling of Windows end-lines into C codes
- - Better compatibility with NumPy 1.12
- - Faster scan optimizations
- - Fixed broadcast checking in scan
 - Bug fixes related to merge optimizer and shape inference
- - many other bug fixes and improvements
+ - Bug fixes related to Debug mode
- - Updated documentation
+ - Lot of other bug fixes, crashes fixes and warning improvements
- - New GPU back-end:
-   - Value of a shared variable is now set inplace
-A total of 26 people contributed to this release since 0.9.0beta1 and 117 since 0.8.0, see the list at the bottom.
+A total of 12 people contributed to this release since 0.9.0rc4 and 125 since 0.8.0, see the lists below.
 Interface changes:
- - In MRG, replaced method `multinomial_wo_replacement()` with new method `choice()`
+ - Merged duplicated diagonal functions into two ops: ``ExtractDiag`` (extract a diagonal to a vector),
+   and ``AllocDiag`` (set a vector as a diagonal of an empty array)
-Convolution updates:
+ - Merged ``CumsumOp/CumprodOp`` into ``CumOp``
- - Implement conv2d_transpose convenience function
+ - Changed grad() method to L_op in many ops that need the outputs to compute gradient
+ - In MRG module:
-GPU:
- - GPUMultinomialFromUniform op now supports multiple dtypes
-New features:
- - OpFromGraph now allows gradient overriding for every input
- - Added Abstract Ops for batch normalization that use cuDNN when available and pure Theano CPU/GPU alternatives otherwise
- - Added new Theano flag cuda.enabled
- - Added new Theano flag print_global_stats to print some global statistics (time spent) at the end
-Others:
- - Split op now has C code for CPU and GPU
- - "theano-cache list" now includes compilation times
+   - Replaced method ``multinomial_wo_replacement()`` with new method ``choice()``
+   - Random generator now tries to infer the broadcast pattern of its output
-Committers since 0.9.0beta1:
- - Frederic Bastien
- - Benjamin Scellier
- - khaotik
- - Steven Bocco
- - Arnaud Bergeron
- - Pascal Lamblin
- - Gijs van Tulder
- - Reyhane Askari
- - Chinnadhurai Sankar
- - Vincent Dumoulin
- - Alexander Matyasko
- - Cesar Laurent
- - Nicolas Ballas
- - affanv14
- - Faruk Ahmed
- - Anton Chechetka
- - Alexandre de Brebisson
- - Amjad Almahairi
- - Dimitar Dimitrov
- - Fuchai
- - Jan Schlüter
- - Jonas Degrave
- - Mathieu Germain
- - Rebecca N. Palmer
- - Simon Lefrancois
- - valtron
-Theano 0.9.0beta1 (24th of January, 2017)
-=========================================
-This release contains a lot of bug fixes and improvements + new features, to prepare the upcoming release candidate.
-Highlights:
- - Many computation and compilation speed up
- - More numerical stability by default for some graph
- - Jenkins (gpu tests run on PR in addition to daily buildbot)
- - Better handling of corner cases for theano functions and graph optimizations
- - More graph optimization (faster execution and smaller graph, so more readable)
- - Less c code compilation
- - Better Python 3.5 support
- - Better numpy 1.12 support
- - Support newer Mac and Windows version
- - Conda packages for Mac, Linux and Windows
- - Theano scripts now works on Windows
- - scan with checkpoint (trade off between speed and memory usage, useful for long sequences)
- - Added a bool dtype
- - New GPU back-end:
-   - float16 storage
-   - better mapping between theano device number and nvidia-smi number, using the PCI bus ID of graphic cards
-   - More pooling support on GPU when cuDNN isn't there
-   - ignore_border=False is now implemented for pooling
-A total of 111 people contributed to this release since 0.8.0, see the list at the bottom.
-Interface changes:
 - New pooling interface
 - Pooling parameters can change at run time
- - When converting empty list/tuple, now we use floatX dtype
+ - Moved ``softsign`` out of sandbox to ``theano.tensor.nnet.softsign``
- - The MRG random generator now try to infer the broadcast pattern of its output
+ - Using floatX dtype when converting empty list/tuple
- - Move softsign out of sandbox to theano.tensor.nnet.softsign
+ - ``Roll`` make the shift be modulo the size of the axis we roll on
- - Roll make the shift be modulo the size of the axis we roll on
+ - ``round()`` default to the same as NumPy: half_to_even
- - Merge CumsumOp/CumprodOp into CumOp
- - round() default to the same as NumPy: half_to_even
 Convolution updates:
+ - Support of full and half modes for 2D and 3D convolutions
+ - Allowed pooling of empty batch
+ - Implement ``conv2d_transpose`` convenience function
 - Multi-cores convolution and pooling on CPU
 - New abstract 3d convolution interface similar to the 2d convolution interface
 - Dilated convolution
 GPU:
 - cuDNN: support versoin 5.1 and wrap batch normalization (2d and 3d) and RNN functions
 - Multiple-GPU, synchrone update (via platoon, use NCCL)
- - GpuAdvancedSubtensor in new back-end
 - Gemv(matrix-vector product) speed up for special shape
- - Support for MaxAndArgMax for some axis combination
- - Support for solve (using cusolver), erfinv and erfcinv
 - cublas gemv workaround when we reduce on an axis with a dimensions size of 0
 - Warn user that some cuDNN algorithms may produce unexpected results in certain environments
   for convolution backward filter operations
+ - ``GPUMultinomialFromUniform`` op now supports multiple dtypes
+ - Support for ``MaxAndArgMax`` for some axis combination
+ - Support for solve (using cusolver), erfinv and erfcinv
+ - Implemented ``GpuAdvancedSubtensor``
 New features:
- - Add gradient of solve, tensorinv (CPU), tensorsolve (CPU) searchsorted (CPU)
+ - Added scalar and elemwise ops for modified Bessel function of order 0 and 1 from scipy.special
- - Add Multinomial Without Replacement
+ - ``OpFromGraph`` now allows gradient overriding for every input
- - conv3d2d support full and half mode (REMOVE?)
+ - Added Abstract Ops for batch normalization that use cuDNN when available and pure Theano CPU/GPU alternatives otherwise
- - Add DownsampleFactorMaxGradGrad.grad
+ - Added gradient of solve, tensorinv (CPU), tensorsolve (CPU), searchsorted (CPU), DownsampleFactorMaxGradGrad (CPU)
- - Allow partial evaluation of compiled function
+ - Added Multinomial Without Replacement
+ - Allowed partial evaluation of compiled function
 - More Rop support
- - Indexing support ellipsis: a[..., 3], a[1,...,3]
+ - Indexing support ellipsis: ``a[..., 3]```, ``a[1,...,3]``
- - Added theano.tensor.{tensor5,dtensor5, ...}
+ - Added ``theano.tensor.{tensor5,dtensor5, ...}``
 - compiledir_format support device
- - Added new Theano flag cmodule.age_thresh_use
+ - Extended  Theano flag ``dnn.enabled`` with new option ``no_check`` to help speed up cuDNN importation
+ - Added New Theano flag ``conv.assert_shape`` to check user-provided shapes at runtime (for debugging)
+ - Added new Theano flag ``cmodule.age_thresh_use``
+ - Added new Theano flag ``cuda.enabled``
+ - Added new Theano flag ``nvcc.cudafe`` to enable faster compilation and import with old CUDA back-end
+ - Added new Theano flag ``print_global_stats`` to print some global statistics (time spent) at the end
+ - Added new Theano flag ``profiling.ignore_first_call``, useful to profile the new gpu back-end
+ - remove ProfileMode (use Theano flag ``profile=True`` instead)
 Others:
- - Speed up argmax only on gpu (without also needing the max)
+ - Split op now has C code for CPU and GPU
- - A few unfrequent bugfix
+ - ``theano-cache list`` now includes compilation times
- - More stack trace in error message
+ - Speed up argmax only on GPU (without also needing the max)
+ - More stack trace in error messages
 - Speed up cholesky grad
- - log(sum(exp(...))) now get stability optimized
+ - ``log(sum(exp(...)))`` now get stability optimized
 Other more detailed changes:
- - Allow more then one output to be an destructive inplace
+ - Added Jenkins (gpu tests run on pull requests in addition to daily buildbot)
- - Add flag profiling.ignore_first_call, useful to profile the new gpu back-end
+ - Removed old benchmark directory and other old files not used anymore
- - Doc/error message fixes/updates
+ - Use of 64-bit indexing in sparse ops to allow matrix with more then 2\ :sup:`31`\ -1 elements
+ - Allowed more then one output to be an destructive inplace
 - More support of negative axis
 - Added the keepdims parameter to the norm function
- - Crash fixes
 - Make scan gradient more deterministic
- - Add support for space in path on Windows
- - remove ProfileMode (use Theano flag profile=True instead)
+Commiters since 0.9.0rc4:
+ - Frederic Bastien
+ - Zhouhan LIN
+ - Tegan Maharaj
+ - Arnaud Bergeron
+ - Matt Graham
+ - Saizheng Zhang
+ - affanv14
+ - Chiheb Trabelsi
+ - Pascal Lamblin
+ - Cesar Laurent
+ - Reyhane Askari
+ - Aarni Koskela
-Committers since 0.8.0:
+Commiters since 0.8.0:
 - Frederic Bastien
 - Arnaud Bergeron
 - Pascal Lamblin
+ - Steven Bocco
 - Ramana Subramanyam
 - Simon Lefrancois
- - Steven Bocco
 - Gijs van Tulder
- - Cesar Laurent
+ - Benjamin Scellier
+ - khaotik
 - Chiheb Trabelsi
+ - Cesar Laurent
 - Chinnadhurai Sankar
- - Mohammad Pezeshki
 - Reyhane Askari
+ - Mohammad Pezeshki
 - Alexander Matyasko
 - Alexandre de Brebisson
+ - Saizheng Zhang
+ - Mathieu Germain
 - Nan Rosemary Ke
 - Pierre Luc Carrier
- - Mathieu Germain
 - Olivier Mastropietro
- - khaotik
- - Saizheng Zhang
 - Thomas George
+ - Zhouhan LIN
 - Iulian Vlad Serban
- - Benjamin Scellier
+ - Matt Graham
 - Francesco Visin
 - Caglar
+ - Faruk Ahmed
 - Harm de Vries
 - Samira Shabanian
+ - Vincent Dumoulin
+ - Nicolas Ballas
+ - affanv14
 - Jakub Sygnowski
+ - Jan Schlüter
 - Samira Ebrahimi Kahou
 - Mikhail Korobov
- - Faruk Ahmed
 - Fei Wang
- - Jan Schlüter
 - Kv Manohar
+ - Tegan Maharaj
 - Jesse Livezey
 - Kelvin Xu
- - Matt Graham
 - Ruslana Makovetsky
 - Sina Honari
 - Bryn Keller
 - Ciyong Chen
- - Nicolas Ballas
 - Vitaliy Kurlin
- - Zhouhan LIN
 - Gokula Krishnan
 - Kumar Krishna Agrawal
 - Ozan Çağlayan
 - Vincent Michalski
+ - Amjad Almahairi
 - Ray Donnelly
 - Tim Cooijmans
- - Vincent Dumoulin
 - happygds
 - mockingjamie
- - Amjad Almahairi
 - Christos Tsirigotis
+ - Florian Bordes
 - Ilya Kulikov
 - RadhikaG
 - Taesup (TS) Kim
 - Ying Zhang
+ - Anton Chechetka
 - Karthik Karanth
 - Kirill Bobyrev
+ - Rebecca N. Palmer
 - Yang Zhang
 - Yaroslav Ganin
+ - Jonas Degrave
 - Liwei Cai
+ - Lucas Beyer
+ - Michael Harradon
 - Morgan Stuart
 - Tim Gasper
 - Xavier Bouthillier
 - p
 - texot
+ - Aarni Koskela
 - Andrés Gottlieb
 - Ben Poole
 - Bhavishya Pohani
 - Carl Thomé
+ - David Bau
+ - Dimitar Dimitrov
 - Evelyn Mitchell
 - Fei Zhan
+ - Fuchai
 - Fábio Perez
 - Gennadiy Tupitsin
 - Gilles Louppe
 - Greg Ciccarelli
 - He
 - Huan Zhang
- - Jonas Degrave
 - Kaixhin
 - Kevin Keraudren
 - Maltimore
 - Marc-Alexandre Cote
 - Marco
 - Marius F. Killinger
+ - Martin Drawitsch
 - Maxim Kochurov
+ - Micah Bojrab
 - Neil
 - Nizar Assaf
 - Rithesh Kumar
@@ -351,5 +270,6 @@ Committers since 0.8.0:
 - root
 - superantichrist
 - tillahoffmann
+ - valtron
 - wazeerzulfikar
 - you-n-g
--- a/NEWS_DEV.txt
+++ b/NEWS_DEV.txt
@@ -15,115 +15,147 @@ git shortlog -sn rel-0.8.0..
 TODO: better Theano conv doc
+# NB: Following notes are related to final release 0.9.0.
 Highlights:
- - Better integration of Theano+libgpuarray packages into conda distribution
- - Better handling of Windows end-lines into C codes
- - Better compatibility with NumPy 1.12
- - Faster scan optimizations
- - Fixed broadcast checking in scan
- - Bug fixes related to merge optimizer and shape inference
- - many other bug fixes and improvements
- - Updated documentation
- - Many computation and compilation speed up
- - More numerical stability by default for some graph
- - Jenkins (gpu tests run on PR in addition to daily buildbot)
- - Better handling of corner cases for theano functions and graph optimizations
- - More graph optimization (faster execution and smaller graph, so more readable)
- - Less c code compilation
 - Better Python 3.5 support
 - Better numpy 1.12 support
- - Support newer Mac and Windows version
 - Conda packages for Mac, Linux and Windows
- - Theano scripts now works on Windows
+ - Support newer Mac and Windows versions
- - scan with checkpoint (trade off between speed and memory usage, useful for long sequences)
+ - More Windows integration:
- - Added a bool dtype
+   - Theano scripts (``theano-cache`` and ``theano-nose``) now works on Windows
+   - Better support for Windows end-lines into C codes
+   - Support for space in paths on Windows
+ - Scan improvements:
+   - More scan optimizations, with faster compilation and gradient computation
+   - Support for checkpoint in scan (trade off between speed and memory usage, useful for long sequences)
+   - Fixed broadcast checking in scan
+ - Graphs improvements:
+   - More numerical stability by default for some graphs
+   - Better handling of corner cases for theano functions and graph optimizations
+   - More graph optimizations with faster toposort, compilation and execution
+   - smaller and more readable graph
+ - Less C code compilation
+ - Added support for bool dtype
+ - Updated and more complete documentation
+ - Bug fixes related to merge optimizer and shape inference
+ - Bug fixes related to Debug mode
+ - Lot of other bug fixes, crashes fixes and warning improvements
 - New GPU back-end:
-   - Fixed offset error in GpuIncSubtensor
+   - Removed warp-synchronous programming to get good results with newer CUDA drivers
-   - Fixed indexing error in GpuAdvancedSubtensor for more than 2 dimensions
+   - More pooling support on GPU when cuDNN isn't available
-   - Value of a shared variable is now set inplace
+   - Full support of ignore_border option for pooling
+   - Inplace storage for shared variables
   - float16 storage
-   - better mapping between theano device number and nvidia-smi number, using the PCI bus ID of graphic cards
+   - Using PCI bus ID of graphic cards for a better mapping between theano device number and nvidia-smi number
-   - More pooling support on GPU when cuDNN isn't there
+   - Added useful stats for GPU in profile mode
-   - ignore_border=False is now implemented for pooling
+   - Added documentation for GPU float16 ops
-   - Removed warp-synchronous programming
+   - Fixed offset error in GpuIncSubtensor
 Interface changes:
- - In MRG, replaced method `multinomial_wo_replacement()` with new method `choice()`
+ - Merged duplicated diagonal functions into two ops: ExtractDiag (extract a diagonal to a vector),
+   and AllocDiag (set a vector as a diagonal of an empty array)
+ - Merged CumsumOp/CumprodOp into CumOp
+ - Changed grad() method to L_op in many ops that need the outputs to compute gradient
+ - In MRG module:
+   - Replaced method ``multinomial_wo_replacement()`` with new method ``choice()``
+   - Random generator now tries to infer the broadcast pattern of its output
 - New pooling interface
 - Pooling parameters can change at run time
- - When converting empty list/tuple, now we use floatX dtype
+ - Moved softsign out of sandbox to theano.tensor.nnet.softsign
- - The MRG random generator now try to infer the broadcast pattern of its output
+ - Using floatX dtype when converting empty list/tuple
- - Move softsign out of sandbox to theano.tensor.nnet.softsign
 - Roll make the shift be modulo the size of the axis we roll on
- - Merge CumsumOp/CumprodOp into CumOp
 - round() default to the same as NumPy: half_to_even
 Convolution updates:
 - Allowed pooling of empty batch
- - Implement conv2d_transpose convenience function
+ - Implement ``conv2d_transpose`` convenience function
 - Multi-cores convolution and pooling on CPU
 - New abstract 3d convolution interface similar to the 2d convolution interface
 - Dilated convolution
 GPU:
- - GPUMultinomialFromUniform op now supports multiple dtypes
 - cuDNN: support versoin 5.1 and wrap batch normalization (2d and 3d) and RNN functions
 - Multiple-GPU, synchrone update (via platoon, use NCCL)
- - GpuAdvancedSubtensor in new back-end
 - Gemv(matrix-vector product) speed up for special shape
- - Support for MaxAndArgMax for some axis combination
- - Support for solve (using cusolver), erfinv and erfcinv
 - cublas gemv workaround when we reduce on an axis with a dimensions size of 0
 - Warn user that some cuDNN algorithms may produce unexpected results in certain environments
   for convolution backward filter operations
+ - GPUMultinomialFromUniform op now supports multiple dtypes
+ - Support for MaxAndArgMax for some axis combination
+ - Support for solve (using cusolver), erfinv and erfcinv
+ - Implemented GpuAdvancedSubtensor
 New features:
- - Added new Theano flag conv.assert_shape
+ - Added scalar and elemwise ops for modified Bessel function of order 0 and 1 from scipy.special
 - OpFromGraph now allows gradient overriding for every input
 - Added Abstract Ops for batch normalization that use cuDNN when available and pure Theano CPU/GPU alternatives otherwise
- - Added new Theano flag cuda.enabled
+ - Added gradient of solve, tensorinv (CPU), tensorsolve (CPU), searchsorted (CPU), DownsampleFactorMaxGradGrad (CPU)
- - Added new Theano flag print_global_stats to print some global statistics (time spent) at the end
+ - Added Multinomial Without Replacement
- - Add gradient of solve, tensorinv (CPU), tensorsolve (CPU) searchsorted (CPU)
- - Add Multinomial Without Replacement
 - conv3d2d support full and half mode (REMOVE?)
- - Add DownsampleFactorMaxGradGrad.grad
+ - Allowed partial evaluation of compiled function
- - Allow partial evaluation of compiled function
 - More Rop support
 - Indexing support ellipsis: a[..., 3], a[1,...,3]
 - Added theano.tensor.{tensor5,dtensor5, ...}
 - compiledir_format support device
+ - Extended  Theano flag dnn.enabled with new option ``no_check`` to help speed up cuDNN importation
+ - Added New Theano flag conv.assert_shape to check user-provided shapes at runtime (for debugging)
+ - Added new Theano flag cuda.enabled
+ - Added new Theano flag print_global_stats to print some global statistics (time spent) at the end
 - Added new Theano flag cmodule.age_thresh_use
+ - Added new Theano flag nvcc.cudafe to enable faster compilation and import with old CUDA back-end
+ - Added new Theano flag profiling.ignore_first_call, useful to profile the new gpu back-end
+ - remove ProfileMode (use Theano flag ``profile=True`` instead)
 Others:
 - Split op now has C code for CPU and GPU
- - "theano-cache list" now includes compilation times
+ - ``theano-cache list`` now includes compilation times
- - Speed up argmax only on gpu (without also needing the max)
+ - Speed up argmax only on GPU (without also needing the max)
- - A few unfrequent bugfix 
+ - More stack trace in error messages
- - More stack trace in error message
 - Speed up cholesky grad
- - log(sum(exp(...))) now get stability optimized
+ - ``log(sum(exp(...)))`` now get stability optimized
 Other more detailed changes:
- - Added new Theano flag nvcc.cudafe to enable faster compilation and import with old CUDA back-end
+ - Added Jenkins (gpu tests run on pull requests in addition to daily buildbot)
- - Use of 64-bit indexing in sparse ops to allow matrix with more then 2\ :sup:`31`\ -1 elements.
+ - Removed old benchmark directory and other old files not used anymore
- - Allow more then one output to be an destructive inplace
+ - Use of 64-bit indexing in sparse ops to allow matrix with more then 2\ :sup:`31`\ -1 elements
- - Add flag profiling.ignore_first_call, useful to profile the new gpu back-end
+ - Allowed more then one output to be an destructive inplace
- - Doc/error message fixes/updates
 - More support of negative axis
 - Added the keepdims parameter to the norm function
- - Crash fixes
 - Make scan gradient more deterministic
- - Add support for space in path on Windows
- - remove ProfileMode (use Theano flag profile=True instead)
 ALL THE PR BELLOW HAVE BEEN CHECKED
+* https://github.com/Theano/Theano/pull/5715
+* https://github.com/Theano/Theano/pull/5502
+* https://github.com/Theano/Theano/pull/5533
+* https://github.com/Theano/Theano/pull/5660
+* https://github.com/Theano/Theano/pull/5682
+* https://github.com/Theano/Theano/pull/5704
+* https://github.com/Theano/Theano/pull/5687
+* https://github.com/Theano/Theano/pull/5455
+* https://github.com/Theano/Theano/pull/5667
+* https://github.com/Theano/Theano/pull/5554
+* https://github.com/Theano/Theano/pull/5486
+* https://github.com/Theano/Theano/pull/5567
+* https://github.com/Theano/Theano/pull/5615
+* https://github.com/Theano/Theano/pull/5672
+* https://github.com/Theano/Theano/pull/5524
 * https://github.com/Theano/Theano/pull/5693
 * https://github.com/Theano/Theano/pull/5702
 * https://github.com/Theano/Theano/pull/5697

--- a/doc/conf.py
+++ b/doc/conf.py
@@ -74,7 +74,7 @@ copyright = '2008--2017, LISA lab'
 # The short X.Y version.
 version = '0.9'
 # The full version, including alpha/beta/rc tags.
-release = '0.9.0rc4'
+release = '0.9.0'
 # There are two options for replacing |today|: either, you set today to some
 # non-false value, then it is used:

--- a/doc/index.txt
+++ b/doc/index.txt
@@ -21,6 +21,8 @@ learning/machine learning <https://mila.umontreal.ca/en/cours/>`_ classes).
 News
 ====
+* 2017/03/20: Release of Theano 0.9.0. Everybody is encouraged to update.
 * 2017/03/13: Release of Theano 0.9.0rc4, with crash fixes and bug fixes.
 * 2017/03/06: Release of Theano 0.9.0rc3, with crash fixes, bug fixes and improvements.

--- a/doc/internal/how_to_release.txt
+++ b/doc/internal/how_to_release.txt
@@ -33,7 +33,7 @@ Edit ``setup.py`` to contain the newest version number ::
    cd Theano-0.X
    vi setup.py     # Edit the MAJOR, MINOR, MICRO and SUFFIX
-``conf.py`` in the ``doc/`` directory should be updated in the following ways:
+``Theano/doc/conf.py`` should be updated in the following ways:
 * Change the ``version`` and ``release`` variables to new version number.
 * Change the upper copyright year to the current year if necessary.

--- a/doc/introduction.txt
+++ b/doc/introduction.txt
@@ -165,7 +165,7 @@ Note: There is no short term plan to support multi-node computation.
 Theano Vision State
 ===================
-Here is the state of that vision as of March 13th, 2017 (after Theano 0.9.0rc4):
+Here is the state of that vision as of March 20th, 2017 (after Theano 0.9.0):
 * We support tensors using the `numpy.ndarray` object and we support many operations on them.
 * We support sparse types by using the `scipy.{csc,csr,bsr}_matrix` object and support some operations on them.

--- a/setup.py
+++ b/setup.py
@@ -53,7 +53,7 @@ PLATFORMS           = ["Windows", "Linux", "Solaris", "Mac OS-X", "Unix"]
 MAJOR               = 0
 MINOR               = 9
 MICRO               = 0
-SUFFIX              = "rc4"  # Should be blank except for rc's, betas, etc.
+SUFFIX              = ""  # Should be blank except for rc's, betas, etc.
 ISRELEASED          = False
 VERSION             = '%d.%d.%d%s' % (MAJOR, MINOR, MICRO, SUFFIX)