merge nc

7a0b8177 · James Bergstra · eed7ee4b · 5def3ff1 · 7a0b8177 · 7a0b8177
--- a/HISTORY.txt
+++ b/HISTORY.txt
@@ -5,6 +5,141 @@
 Release Notes
 =============

+Theano 0.4.0 (2011-06-13)
+=========================
+
+Change in output memory storage for Ops:
+ If you implemented custom Ops, with either C or Python implementation,
+ this will concern you.
+
+ The contract for memory storage of Ops has been changed. In particular,
+ it is no longer guaranteed that output memory buffers are either empty,
+ or allocated by a previous execution of the same Op.
+
+ Right now, here is the situation:
+  * For Python implementation (perform), what is inside output_storage
+    may have been allocated from outside the perform() function, for
+    instance by another node (e.g., Scan) or the Mode. If that was the
+    case, the memory can be assumed to be C-contiguous (for the moment).
+  * For C implementations (c_code), nothing has changed yet.
+
+ In a future version, the content of the output storage, both for Python and C
+ versions, will either be NULL, or have the following guarantees:
+  * It will be a Python object of the appropriate Type (for a Tensor variable,
+    a numpy.ndarray, for a GPU variable, a CudaNdarray, for instance)
+  * It will have the correct number of dimensions, and correct dtype
+ However, its shape and memory layout (strides) will not be guaranteed.
+
+ When that change is made, the config flag DebugMode.check_preallocated_output
+ will help you find implementations that are not up-to-date.
+
+Deprecation:
+ * tag.shape attribute deprecated (#633)
+ * CudaNdarray_new_null is deprecated in favour of CudaNdarray_New
+ * Dividing integers with / is deprecated: use // for integer division, or
+   cast one of the integers to a float type if you want a float result (you may
+   also change this behavior with config.int_division).
+ * Removed (already deprecated) sandbox/compile module
+ * Removed (already deprecated) incsubtensor and setsubtensor functions,
+   inc_subtensor and set_subtensor are to be used instead.
+
+Bugs fixed:
+ * In CudaNdarray.__{iadd,idiv}__, when it is not implemented, return the error.
+ * THEANO_FLAGS='optimizer=None' now works as expected
+ * Fixed memory leak in error handling on GPU-to-host copy
+ * Fix relating specifically to Python 2.7 on Mac OS X
+ * infer_shape can now handle Python longs
+ * Trying to compute x % y with one or more arguments being complex now
+   raises an error.
+ * The output of random samples computed with uniform(..., dtype=...) is
+   guaranteed to be of the specified dtype instead of potentially being of a
+   higher-precision dtype.
+ * The perform() method of DownsampleFactorMax did not give the right result
+   when reusing output storage. This happen only if you use the Theano flags 
+   'linker=c|py_nogc' or manually specify the mode to be 'c|py_nogc'.
+
+Crash fixed:
+ * Work around a bug in gcc 4.3.0 that make the compilation of 2d convolution
+   crash.
+ * Some optimizations crashed when the "ShapeOpt" optimization was disabled.
+
+Optimization:
+ * Optimize all subtensor followed by subtensor.
+
+GPU:
+ * Move to the gpu fused elemwise that have other dtype then float32 in them
+   (except float64) if the input and output are float32.
+   * This allow to move elemwise comparisons to the GPU if we cast it to
+     float32 after that.
+ * Implemented CudaNdarray.ndim to have the same interface in ndarray.
+ * Fixed slowdown caused by multiple chained views on CudaNdarray objects
+ * CudaNdarray_alloc_contiguous changed so as to never try to free
+   memory on a view: new "base" property
+ * Safer decref behaviour in CudaNdarray in case of failed allocations
+ * New GPU implementation of tensor.basic.outer
+ * Multinomial random variates now available on GPU
+
+New features:
+ * ProfileMode
+    * profile the scan overhead
+    * simple hook system to add profiler
+    * reordered the output to be in the order of more general to more specific
+ * DebugMode now checks Ops with different patterns of preallocated memory,
+   configured by config.DebugMode.check_preallocated_output.
+ * var[vector of index] now work, (grad work recursively, the direct grad
+   work inplace, gpu work)
+    * limitation: work only of the outer most dimensions.
+ * New way to test the graph as we build it. Allow to easily find the source
+   of shape mismatch error:
+   `http://deeplearning.net/software/theano/tutorial/debug_faq.html#interactive-debugger`__
+ * cuda.root inferred if nvcc is on the path, otherwise defaults to
+   /usr/local/cuda
+ * Better graph printing for graphs involving a scan subgraph
+ * Casting behavior can be controlled through config.cast_policy,
+   new (experimental) mode.
+ * Smarter C module cache, avoiding erroneous usage of the wrong C
+   implementation when some options change, and avoiding recompiling the
+   same module multiple times in some situations.
+ * The "theano-cache clear" command now clears the cache more thoroughly.
+ * More extensive linear algebra ops (CPU only) that wrap scipy.linalg
+   now available in the sandbox.
+ * CUDA devices 4 - 16 should now be available if present.
+ * infer_shape support for the View op, better infer_shape support in Scan
+ * infer_shape supported in all case of subtensor
+ * tensor.grad now gives an error by default when computing the gradient
+   wrt a node that is disconnected from the cost (not in the graph, or
+   no continuous path from that op to the cost).
+ * New tensor.isnan and isinf functions.
+
+Documentation:
+ * Better commenting of cuda_ndarray.cu
+ * Fixes in the scan documentation: add missing declarations/print statements
+ * Better error message on failed __getitem__
+ * Updated documentation on profile mode
+ * Better documentation of testing on Windows
+ * Better documentation of the 'run_individual_tests' script
+
+Unit tests:
+ * More strict float comparaison by default
+ * Reuse test for subtensor of tensor for gpu tensor(more gpu test)
+ * Tests that check for aliased function inputs and assure appropriate copying
+   (#374)
+ * Better test of copies in CudaNdarray
+ * New tests relating to the new base pointer requirements
+ * Better scripts to run tests individually or in batches
+ * Some tests are now run whenever cuda is available and not just when it has
+   been enabled before
+ * Tests display less pointless warnings.
+
+Other:
+ * Correctly put the broadcast flag to True in the output var of
+   a Reshape op when we receive an int 1 in the new shape.
+ * pydotprint: high contrast mode is now the default, option to print
+   more compact node names.
+ * pydotprint: How trunk label that are too long.
+ * More compact printing (ignore leading "Composite" in op names)
+
+
 Theano 0.3.1 (2011-02-21)
 =========================


--- a/NEWS.txt
+++ b/NEWS.txt
-Modifications in the trunk since the last release
+Modifications in the 0.4.1 release candidate 1(28 July 2011)

-Theano 0.4.0 (2011-06-27)
--------------------------------------------------
+Deprecation (will be removed in Theano 0.5):

-Change in output memory storage for Ops:
- If you implemented custom Ops, with either C or Python implementation,
- this will concern you.
+ * The string mode (accepted only by theano.function()) FAST_RUN_NOGC. Use Mode(linker='c|py_nogc') instead.
+ * The string mode (accepted only by theano.function()) STABILIZE. Use Mode(optimizer='stabilize') instead.
+ * scan interface change:
+   * The use of `return_steps` for specifying how many entries of the output
+   scan has been depricated
+     * The same thing can be done by applying a subtensor on the output
+     return by scan to select a certain slice
+  * The inner function (that scan receives) should return its outputs and
+  updates following this order:
+    [outputs], [updates], [condition]. One can skip any of the three if not
+    used, but the order has to stay unchanged.

- The contract for memory storage of Ops has been changed. In particular,
- it is no longer guaranteed that output memory buffers are either empty,
- or allocated by a previous execution of the same Op.
-
- Right now, here is the situation:
-  * For Python implementation (perform), what is inside output_storage
-    may have been allocated from outside the perform() function, for
-    instance by another node (e.g., Scan) or the Mode. If that was the
-    case, the memory can be assumed to be C-contiguous (for the moment).
-  * For C implementations (c_code), nothing has changed yet.
-
- In a future version, the content of the output storage, both for Python and C
- versions, will either be NULL, or have the following guarantees:
-  * It will be a Python object of the appropriate Type (for a Tensor variable,
-    a numpy.ndarray, for a GPU variable, a CudaNdarray, for instance)
-  * It will have the correct number of dimensions, and correct dtype
- However, its shape and memory layout (strides) will not be guaranteed.
-
- When that change is made, the config flag DebugMode.check_preallocated_output
- will help you find implementations that are not up-to-date.
-
-Deprecation:
+Decrecated in 0.4.0:
 * tag.shape attribute deprecated (#633)
 * CudaNdarray_new_null is deprecated in favour of CudaNdarray_New
 * Dividing integers with / is deprecated: use // for integer division, or
   cast one of the integers to a float type if you want a float result (you may
   also change this behavior with config.int_division).
- * Removed (already deprecated) sandbox/compile module
- * Removed (already deprecated) incsubtensor and setsubtensor functions,
-   inc_subtensor and set_subtensor are to be used instead.
+
+New features:
+
+ * `R_op <http://deeplearning.net/software/theano/tutorial/gradients.html>`_ macro like theano.tensor.grad
+   * Not all tests are done yet (TODO)
+ * Added alias theano.tensor.bitwise_{and,or,xor,not}. They are the numpy names.
+ * Updates returned by Scan (you need to pass them to the theano.function) are now a new Updates class.
+   That allow more check and easier work with them. The Updates class is a subclass of dict
+ * Scan can now work in a "do while" loop style.
+   * We scan until a condition is met.
+   * There is a minimum of 1 iteration(can't do "while do" style loop)
+ * The "Interactive Debugger" (compute_test_value theano flags)
+   * Now should work with all ops (even the one with only C code)
+   * In the past some errors were caught and re-raised as unrelated errors (ShapeMismatch replaced with NotImplemented). We don't do that anymore.
+ * The new Op.make_thunk function(introduced in 0.4.0) is now used by constant_folding and DebugMode
+ * Added A_TENSOR_VARIABLE.astype() as a way to cast. NumPy allows this syntax.
+ * New BLAS GER implementation.
+ * Insert GEMV more frequently.
+ * Added new ifelse(scalar condition, rval_if_true, rval_if_false) Op.
+   * This is a subset of the elemwise switch (tensor condition, rval_if_true, rval_if_false).
+   * With the new feature in the sandbox, only one of rval_if_true or rval_if_false will be evaluated.
+
+Optimizations:
+
+ * Subtensor has C code
+ * {Inc,Set}Subtensor has C code
+ * ScalarFromTensor has C code
+ * dot(zeros,x) and dot(x,zeros)
+ * IncSubtensor(x, zeros, idx) -> x
+ * SetSubtensor(x, x[idx], idx) -> x (when x is a constant)
+ * subtensor(alloc,...) -> alloc
+ * Many new scan optimization 
+   * Lower scan execution overhead with a Cython implementation
+   * Removed scan double compilation (by using the new Op.make_thunk mechanism)
+   * Certain computations from the inner graph are now Pushed out into the outer
+   graph. This means they are not re-comptued at every step of scan.
+   * Different scan ops get merged now into a single op (if possible), reducing the
+   overhead and sharing computations between the two instances
+
+GPU:
+
+ * PyCUDA/Theano bridge and `documentation <http://deeplearning.net/software/theano/tutorial/pycuda.html>`_.
+   * New function to easily convert pycuda GPUArray object to and from CudaNdarray object
+   * Fixed a bug if you crated a view of a manually created CudaNdarray that are view of GPUArray.
+ * Removed a warning when nvcc is not available and the user did not requested it.
+ * renamed config option cuda.nvccflags -> nvcc.flags

 Bugs fixed:
- * In CudaNdarray.__{iadd,idiv}__, when it is not implemented, return the error.
- * THEANO_FLAGS='optimizer=None' now works as expected
- * Fixed memory leak in error handling on GPU-to-host copy
- * Fix relating specifically to Python 2.7 on Mac OS X
- * infer_shape can now handle Python longs
- * Trying to compute x % y with one or more arguments being complex now
-   raises an error.
- * The output of random samples computed with uniform(..., dtype=...) is
-   guaranteed to be of the specified dtype instead of potentially being of a
-   higher-precision dtype.
- * The perform() method of DownsampleFactorMax did not give the right result
-   when reusing output storage. This happen only if you use the Theano flags 
-   'linker=c|py_nogc' or manually specify the mode to be 'c|py_nogc'.
+
+ * In one case an AdvancedSubtensor1 could be converted to a GpuAdvancedIncSubtensor1 insted of GpuAdvancedSubtensor1.
+   It probably didn't happen due to the order of optimizations, but that order is not guaranteed to be the same on all computers.
+ * Derivative of set_subtensor was wrong.
+ * Derivative of Alloc was wrong.

 Crash fixed:
- * Work around a bug in gcc 4.3.0 that make the compilation of 2d convolution
-   crash.
- * Some optimizations crashed when the "ShapeOpt" optimization was disabled.

-Optimization:
- * Optimize all subtensor followed by subtensor.
+ * On an unusual Python 2.4.4 on Windows
+ * When using a C cache copied from another location
+ * On Windows 32 bits when setting a complex64 to 0.
+ * Compilation crash with CUDA 4
+ * When wanting to copy the compilation cache from a computer to another
+   * This can be useful for using Theano on a computer without a compiler.
+ * GPU: 
+   * Compilation crash fixed under Ubuntu 11.04
+   * Compilation crash fixed with CUDA 4.0

-GPU:
- * Move to the gpu fused elemwise that have other dtype then float32 in them
-   (except float64) if the input and output are float32.
-   * This allow to move elemwise comparisons to the GPU if we cast it to
-     float32 after that.
- * Implemented CudaNdarray.ndim to have the same interface in ndarray.
- * Fixed slowdown caused by multiple chained views on CudaNdarray objects
- * CudaNdarray_alloc_contiguous changed so as to never try to free
-   memory on a view: new "base" property
- * Safer decref behaviour in CudaNdarray in case of failed allocations
- * New GPU implementation of tensor.basic.outer
- * Multinomial random variates now available on GPU
+Sandbox:

-New features:
- * ProfileMode
-    * profile the scan overhead
-    * simple hook system to add profiler
-    * reordered the output to be in the order of more general to more specific
- * DebugMode now checks Ops with different patterns of preallocated memory,
-   configured by config.DebugMode.check_preallocated_output.
- * var[vector of index] now work, (grad work recursively, the direct grad
-   work inplace, gpu work)
-    * limitation: work only of the outer most dimensions.
- * New way to test the graph as we build it. Allow to easily find the source
-   of shape mismatch error:
-   `http://deeplearning.net/software/theano/tutorial/debug_faq.html#interactive-debugger`__
- * cuda.root inferred if nvcc is on the path, otherwise defaults to
-   /usr/local/cuda
- * Better graph printing for graphs involving a scan subgraph
- * Casting behavior can be controlled through config.cast_policy,
-   new (experimental) mode.
- * Smarter C module cache, avoiding erroneous usage of the wrong C
-   implementation when some options change, and avoiding recompiling the
-   same module multiple times in some situations.
- * The "theano-cache clear" command now clears the cache more thoroughly.
- * More extensive linear algebra ops (CPU only) that wrap scipy.linalg
-   now available in the sandbox.
- * CUDA devices 4 - 16 should now be available if present.
- * infer_shape support for the View op, better infer_shape support in Scan
- * infer_shape supported in all case of subtensor
- * tensor.grad now gives an error by default when computing the gradient
-   wrt a node that is disconnected from the cost (not in the graph, or
-   no continuous path from that op to the cost).
- * New tensor.isnan and isinf functions.
+ * MRG random generator now implements the same casting behavior as the regular random generator.
+
+Sandbox New features(not enabled by default):
+
+ * New Linkers (theano flags linker={vm,cvm})
+   * The new linker allows lazy evaluation of the new ifelse op, meaning we compute only the true or false branch depending of the condition. This can speed up some types of computation.
+   * Uses a new profiling system (that currently tracks less stuff)
+   * The cvm is implemented in C, so it lowers Theano's overhead.
+   * The vm is implemented in python. So it can help debugging in some cases.
+   * In the future, the default will be the cvm.
+ * Some new not yet well tested sparse ops: theano.sparse.sandbox.{SpSum, Diag, SquareDiagonal, ColScaleCSC, RowScaleCSC, Remove0, EnsureSortedIndices, ConvolutionIndices}

 Documentation:
- * Better commenting of cuda_ndarray.cu
- * Fixes in the scan documentation: add missing declarations/print statements
- * Better error message on failed __getitem__
- * Updated documentation on profile mode
- * Better documentation of testing on Windows
- * Better documentation of the 'run_individual_tests' script
-
-Unit tests:
- * More strict float comparaison by default
- * Reuse test for subtensor of tensor for gpu tensor(more gpu test)
- * Tests that check for aliased function inputs and assure appropriate copying
-   (#374)
- * Better test of copies in CudaNdarray
- * New tests relating to the new base pointer requirements
- * Better scripts to run tests individually or in batches
- * Some tests are now run whenever cuda is available and not just when it has
-   been enabled before
- * Tests display less pointless warnings.
-
-Other:
- * Correctly put the broadcast flag to True in the output var of
-   a Reshape op when we receive an int 1 in the new shape.
- * pydotprint: high contrast mode is now the default, option to print
-   more compact node names.
- * pydotprint: How trunk label that are too long.
- * More compact printing (ignore leading "Composite" in op names)
+
+ * How to compute the `Jacobian, Hessian, Jacobian times a vector, Hessian times a vector <http://deeplearning.net/software/theano/tutorial/gradients.html>`_.
+ * Slide for a 3 hours class with exercises that was done at the HPCS2011 Conference in Montreal.
+
+Others:
+
+ * Logger name renamed to be consistent.
+ * Logger function simplified and made more consistent.
+ * Fixed transformation of error by other not related error with the compute_test_value Theano flag.
+ * Compilation cache enhancements.
+ * Made compatible with NumPy 1.6 and SciPy 0.9
+ * Fix tests when there was new dtype in NumPy that is not supported by Theano.
+ * Fixed some tests when SciPy is not available.
+ * Don't compile anything when Theano is imported. Compile support code when we compile the first C code.
+ * Python 2.4 fix:
+   * Fix the file theano/misc/check_blas.py
+   * For python 2.4.4 on Windows, replaced float("inf") with numpy.inf.
+ * Removes useless inputs to a scan node
+   * Beautification mostly, making the graph more visible. Such inputs would appear as a consequence of other optimizations
+
+Core:
+
+ * there is a new mechanism that lets an Op permit that one of its
+   inputs to be aliased to another destroyed input.  This will generally
+   result in incorrect calculation, so it should be used with care!  The
+   right way to use it is when the caller can guarantee that even if
+   these two inputs look aliased, they actually will never overlap. This
+   mechanism can be used, for example, by a new alternative approach to
+   implementing Scan.  If an op has an attribute called
+   "destroyhandler_tolerate_aliased" then this is what's going on.
+   IncSubtensor is thus far the only Op to use this mechanism.Mechanism
+
--- a/doc/cifarSC2011/advanced_theano.txt
+++ b/doc/cifarSC2011/advanced_theano.txt

 .. _advanced_theano:

-
 ***************
 Advanced Theano
 ***************

+Conditions
+----------
+**IfElse**
+
+- Build condition over symbolic variables.
+- IfElse Op takes a boolean condition and two variables to compute as input.
+- While Switch Op evaluates both 'output' variables, IfElse Op is lazy and only
+  evaluates one variable respect to the condition.
+
+**IfElse Example: Comparison with Switch**
+
+.. code-block:: python
+
+  from theano import tensor as T
+  from theano.lazycond import ifelse
+  import theano, time, numpy
+
+  a,b = T.scalars('a','b')
+  x,y = T.matrices('x','y')
+  
+  z_switch = T.switch(T.lt(a,b), T.mean(x), T.mean(y))
+  z_lazy = ifelse(T.lt(a,b), T.mean(x), T.mean(y))
+
+  f_switch = theano.function([a,b,x,y], z_switch, 
+                      mode=theano.Mode(linker='vm'))
+  f_lazyifelse = theano.function([a,b,x,y], z_lazy,
+                      mode=theano.Mode(linker='vm'))
+
+  val1 = 0.
+  val2 = 1.
+  big_mat1 = numpy.ones((10000,1000))
+  big_mat2 = numpy.ones((10000,1000))
+
+  n_times = 10
+
+  tic = time.clock()
+  for i in xrange(n_times):
+      f_switch(val1, val2, big_mat1, big_mat2)
+  print 'time spent evaluating both values %f sec'%(time.clock()-tic)
+
+  tic = time.clock()
+  for i in xrange(n_times):
+      f_lazyifelse(val1, val2, big_mat1, big_mat2)
+  print 'time spent evaluating one value %f sec'%(time.clock()-tic)
+
+IfElse Op spend less time (about an half) than Switch since it computes only
+one variable instead of both.
+
+>>> python ifelse_switch.py
+time spent evaluating both values 0.6700 sec
+time spent evaluating one value 0.3500 sec
+
+Note that IfElse condition is a boolean while Switch condition is a tensor, so
+Switch is more general.
+
+It is actually important to use  ``linker='vm'`` or ``linker='cvm'``,
+otherwise IfElse will compute both variables and take the same computation
+time as the Switch Op. The linker is not currently set by default to 'cvm' but
+it will be in a near future.
+
+Loops
+-----
+
+**Scan**
+
+- General form of **recurrence**, which can be used for looping.
+- **Reduction** and **map** (loop over the leading dimensions) are special cases of Scan
+- You 'scan' a function along some input sequence, producing an output at each time-step
+- The function can see the **previous K time-steps** of your function
+- ``sum()`` could be computed by scanning the z + x(i) function over a list, given an initial state of ``z=0``.
+- Often a for-loop can be expressed as a ``scan()`` operation, and ``scan`` is the closest that Theano comes to looping.
+- The advantage of using ``scan`` over for loops
+  
+  - The number of iterations to be part of the symbolic graph
+  - Minimizes GPU transfers if GPU is involved
+  - Compute gradients through sequential steps
+  - Slightly faster then using a for loop in Python with a compiled Theano function
+  - Can lower the overall memory usage by detecting the actual amount of memory needed
+
+**Scan Example: Computing pow(A,k)**
+
+.. code-block:: python
+
+  import theano
+  import theano.tensor as T
+
+  k = T.iscalar("k"); A = T.vector("A")
+
+  def inner_fct(prior_result, A): return prior_result * A
+  # Symbolic description of the result
+  result, updates = theano.scan(fn=inner_fct,
+                              outputs_info=T.ones_like(A),
+                              non_sequences=A, n_steps=k)
+
+  # Scan has provided us with A**1 through A**k.  Keep only the last
+  # value. Scan notices this and does not waste memory saving them.
+  final_result = result[-1]
+  
+  power = theano.function(inputs=[A,k], outputs=final_result,
+                        updates=updates)
+  
+  print power(range(10),2)
+  #[  0.   1.   4.   9.  16.  25.  36.  49.  64.  81.]
+
+
+**Scan Example: Calculating a Polynomial**
+
+.. code-block:: python
+
+  import theano
+  import theano.tensor as T
+
+  coefficients = theano.tensor.vector("coefficients")
+  x = T.scalar("x"); max_coefficients_supported = 10000
+
+  # Generate the components of the polynomial
+  full_range=theano.tensor.arange(max_coefficients_supported)
+  components, updates = theano.scan(fn=lambda coeff, power, free_var:
+                                     coeff * (free_var ** power),
+                                  outputs_info=None,
+                                  sequences=[coefficients, full_range],
+                                  non_sequences=x)
+  polynomial = components.sum()
+  calculate_polynomial = theano.function(inputs=[coefficients, x],
+                                       outputs=polynomial)
+
+  test_coeff = numpy.asarray([1, 0, 2], dtype=numpy.float32)
+  print calculate_polynomial(test_coeff, 3)
+  # 19.0
+
+
+
+Exercise 4
+-----------
+
+- Run both examples 
+- Modify and execute the polynomial example to have the reduction done by scan
+
+
+
 Compilation pipeline
 --------------------

-.. image:: pics/pipeline.png
+.. image:: ../hpcs2011_tutorial/pics/pipeline.png
   :width: 400 px

 Inplace optimization
@@ -113,7 +252,7 @@ Theano output:
      - Try the Theano flag floatX=float32
    """

-Exercise 4
+Exercise 5
 -----------

 - In the last exercises, do you see a speed up with the GPU?
@@ -167,19 +306,19 @@ Elemwise{Composite{neg,{sub,{{scalar_sigmoid,GT},neg}}}} [@183160204] ''   2

 >>> theano.printing.pydotprint_variables(prediction)

-.. image:: pics/logreg_pydotprint_prediction.png
+.. image:: ../hpcs2011_tutorial/pics/logreg_pydotprint_prediction.png
   :width: 800 px

 All pydotprint* requires graphviz and pydot

 >>> theano.printing.pydotprint(predict)

-.. image:: pics/logreg_pydotprint_predic.png
+.. image:: ../hpcs2011_tutorial/pics/logreg_pydotprint_predic.png
   :width: 800 px

 >>> theano.printing.pydotprint(train) # This is a small train example!

-.. image:: pics/logreg_pydotprint_train.png
+.. image:: ../hpcs2011_tutorial/pics/logreg_pydotprint_train.png
   :width: 1500 px


@@ -206,85 +345,6 @@ Debugging
  - Few optimizations
  - Run Python code (better error messages and can be debugged interactively in the Python debugger)

-
-Loops
-----
-
-**Scan**
-
- General form of **recurrence**, which can be used for looping.
- **Reduction** and **map** (loop over the leading dimensions) are special cases of Scan
- You 'scan' a function along some input sequence, producing an output at each time-step
- The function can see the **previous K time-steps** of your function
- ``sum()`` could be computed by scanning the z + x(i) function over a list, given an initial state of ``z=0``.
- Often a for-loop can be expressed as a ``scan()`` operation, and ``scan`` is the closest that Theano comes to looping.
- The advantage of using ``scan`` over for loops
-  
-  - The number of iterations to be part of the symbolic graph
-  - Minimizes GPU transfers if GPU is involved
-  - Compute gradients through sequential steps
-  - Slightly faster then using a for loop in Python with a compiled Theano function
-  - Can lower the overall memory usage by detecting the actual amount of memory needed
-
-**Scan Example: Computing pow(A,k)**
-
-.. code-block:: python
-
-  import theano
-  import theano.tensor as T
-
-  k = T.iscalar("k"); A = T.vector("A")
-
-  def inner_fct(prior_result, A): return prior_result * A
-  # Symbolic description of the result
-  result, updates = theano.scan(fn=inner_fct,
-                              outputs_info=T.ones_like(A),
-                              non_sequences=A, n_steps=k)
-
-  # Scan has provided us with A**1 through A**k.  Keep only the last
-  # value. Scan notices this and does not waste memory saving them.
-  final_result = result[-1]
-  
-  power = theano.function(inputs=[A,k], outputs=final_result,
-                        updates=updates)
-  
-  print power(range(10),2)
-  #[  0.   1.   4.   9.  16.  25.  36.  49.  64.  81.]
-
-
-**Scan Example: Calculating a Polynomial**
-
-.. code-block:: python
-
-  import theano
-  import theano.tensor as T
-
-  coefficients = theano.tensor.vector("coefficients")
-  x = T.scalar("x"); max_coefficients_supported = 10000
-
-  # Generate the components of the polynomial
-  full_range=theano.tensor.arange(max_coefficients_supported)
-  components, updates = theano.scan(fn=lambda coeff, power, free_var:
-                                     coeff * (free_var ** power),
-                                  outputs_info=None,
-                                  sequences=[coefficients, full_range],
-                                  non_sequences=x)
-  polynomial = components.sum()
-  calculate_polynomial = theano.function(inputs=[coefficients, x],
-                                       outputs=polynomial)
-
-  test_coeff = numpy.asarray([1, 0, 2], dtype=numpy.float32)
-  print calculate_polynomial(test_coeff, 3)
-  # 19.0
-
-
-
-Exercise 5
-----------
-
- Run both examples 
- Modify and execute the polynomial example to have the reduction done by scan
-
 Known limitations
 -----------------

@@ -304,5 +364,3 @@ Known limitations
  - Disabling a few optimizations can speed up compilation
  - Usually too many nodes indicates a problem with the graph

- Lazy evaluation in a branch (We will try to merge this summer)
-
--- a/doc/cifarSC2011/boot_camp_overview.txt
+++ b/doc/cifarSC2011/boot_camp_overview.txt
@@ -13,13 +13,13 @@ on the afternoons of Aug 2, 3, 5, and 6 (but not Aug 4th).
 Day 1
 -----

- * Show of hands - what is your background?
+* Show of hands - what is your background?

- * Python & Numpy in a nutshell
+* Python & Numpy in a nutshell

- * Theano basics
+* Theano basics

- * Quick tour through Deep Learning Tutorials (think about projects)
+* Quick tour through Deep Learning Tutorials (think about projects)

 .. :
    day 1:
@@ -38,35 +38,31 @@ Day 1
 Day 2
 -----

-  * Loop/Condition in Theano (10-20m)
+* Loop/Condition in Theano (10-20m)

-  * Propose/discuss projects
+* Propose/discuss projects

-  * Form groups and start projects!
+* Form groups and start projects!

 Day 3
 -----

- * Advanced Theano (30 minutes)
+* Advanced Theano (30 minutes)

-   * Debugging, profiling, compilation pipeline
+ * Debugging, profiling, compilation pipeline

- * Projects / General hacking / code-sprinting.
+* Projects / General hacking / code-sprinting.

 Day 4
 -----

- * *You choose* (we can split the group)
+* *You choose* (we can split the group)

-   * Extending Theano
+ * Extending Theano

-     * How to write an Op
+  * How to write an Op

-     * How to use pycuda code in Theano
+  * How to use pycuda code in Theano

-   * Projects / General hacking / code-sprinting.
+* Projects / General hacking / code-sprinting.

-Note - the schedule here is a guideline.
-We can adapt it in reponse to developments in the hands-on work.
-The point is for you to learn something about the practice of machine
-learning.
--- a/doc/cifarSC2011/extending_theano.txt
+++ b/doc/cifarSC2011/extending_theano.txt
@@ -14,7 +14,7 @@ Theano graphs

 Inputs and Outputs are lists of Theano variables

-.. image:: pics/apply_node.png
+.. image:: ../hpcs2011_tutorial/pics/apply_node.png
    :width: 500 px

 Op contract

--- a/doc/cifarSC2011/index.txt
+++ b/doc/cifarSC2011/index.txt
@@ -21,9 +21,10 @@ What does it do?
 It complements the Python numeric/scientific software stack (e.g. numpy, scipy,
 scikits, matplotlib, PIL.)

-Design and feature set has been driven by research in the machine learning group at the University of
-Montreal (Yoshua Bengio, Pascal Vincent, Douglas Eck).
-Result: a very good library for doing research in deep
+Design and feature set has been driven by machine learning research
+at the University of
+Montreal (groups of Yoshua Bengio, Pascal Vincent, Douglas Eck).
+The result is a very good library for doing research in deep
 learning and neural network training, and a flexible framework for
 many other models and algorithms in machine learning more generally.

@@ -53,7 +54,7 @@ calculations on other data structures.
 Contents
 --------

-The structured part of the course will be a walk-through of the following
+The structured part of these lab sessions will be a walk-through of the following
 material. Interleaved with this structured part will be blocks of time for
 individual or group work.  The idea is that you can try out Theano and get help
 from gurus on hand if you get stuck.

--- a/doc/cifarSC2011/introduction.txt
+++ b/doc/cifarSC2011/introduction.txt
@@ -9,169 +9,226 @@ Introduction
 Background Questionaire
 -----------------------

- * Who has used Theano before?
+* Who has used Theano before?

-   * What did you do with it?
+ * What did you do with it?

- * Who has used Python? numpy? scipy? matplotlib?
+* Who has used Python? numpy? scipy? matplotlib?

- * Who has used iPython?
+* Who has used iPython?

-   * Who has used it as a distributed computing engine?
+ * Who has used it as a distributed computing engine?

- * Who has done C/C++ programming?
+* Who has done C/C++ programming?

- * Who has organized computation around a particular physical memory layout?
+* Who has organized computation around a particular physical memory layout?

- * Who has used a multidimensional array of >2 dimensions?
+* Who has used a multidimensional array of >2 dimensions?

- * Who has written a Python module in C before?
+* Who has written a Python module in C before?

-   * Who has written a program to *generate* Python modules in C?
+ * Who has written a program to *generate* Python modules in C?

- * Who has used a templating engine?
+* Who has used a templating engine?

- * Who has programmed a GPU before?
+* Who has programmed a GPU before?

-   * Using OpenGL / shaders ?
+ * Using OpenGL / shaders ?

-   * Using CUDA (runtime? / driver?)
+ * Using CUDA (runtime? / driver?)

-   * Using PyCUDA ?
+ * Using PyCUDA ?

-   * Using OpenCL / PyOpenCL ?
+ * Using OpenCL / PyOpenCL ?

-   * Other?
+ * Using cudamat / gnumpy ?

- * Who has used Cython?
+ * Other?
+
+* Who has used Cython?


 Python in one slide
 -------------------

-Features:
-
- * General-purpose high-level OO interpreted language
+* General-purpose high-level OO interpreted language
 
- * Emphasizes code readability
+* Emphasizes code readability
 
- * Comprehensive standard library
+* Comprehensive standard library
 
- * Dynamic type and memory management
+* Dynamic type and memory management
+
+* Built-in types: int, float, str, list, dict, tuple, object
+
+* Slow execution

- * builtin types: int, float, str, list, dict, tuple, object
+* Popular in web-dev and scientific communities

-Syntax sample:

 .. code-block:: python

-    a = {'a': 5, 'b': None}   # dictionary of two elements
-    b = [1,2,3]               # list of three int literals
+    #######################
+    # PYTHON SYNTAX EXAMPLE
+    #######################
+    a = 1                     # no type declaration required!
+    b = (1,2,3)               # tuple of three int literals
+    c = [1,2,3]               # list of three int literals
+    d = {'a': 5, b: None}     # dictionary of two elements
+                              # N.B. string literal, None
+
+    print d['a']              # square brackets index
+    # -> 5
+    print d[(1,2,3)]          # new tuple == b, retrieves None
+    # -> None
+    print d[6]
+    # raises KeyError Exception
+
+    x, y, z = 10, 100, 100    # multiple assignment from tuple
+    x, y, z = b               # unpacking a sequence
+
+    b_squared = [b_i**2 for b_i in b]  # list comprehension

    def foo(b, c=3):          # function w default param c
        return a + b + c      # note scoping, indentation

+    foo(5)                    # calling a function
+    # -> 1 + 5 + 3 == 9       # N.B. scoping
+    foo(b=6, c=2)             # calling with named args
+    # -> 1 + 6 + 2 == 9
+
+    print b[1:3]              # slicing syntax

+    class Foo(object):        # Defining a class
+        def __init__(self):
+            self.a = 5
+        def hello(self):
+            return self.a

- * List comprehension: ``[i+3 for i in range(10)]``
+    f = Foo()                 # Creating a class instance
+    print f.hello()           # Calling methods of objects
+    # -> 5 
+
+    class Bar(Foo):           # Defining a subclass
+        def __init__(self, a):
+            self.a = a
+
+    print Bar(99).hello()     # Creating an instance of Bar
+    # -> 99

 Numpy in one slide
 ------------------

- * Python floats are full-fledged objects on the heap
+* Python floats are full-fledged objects on the heap

-   * Not suitable for high-performance computing!
+ * Not suitable for high-performance computing!

- * Numpy provides a N-dimensional numeric array in Python
+* Numpy provides a N-dimensional numeric array in Python

-   * Perfect for high-performance computing.
+ * Perfect for high-performance computing.

- * Numpy provides:
+* Numpy provides

-  * elementwise computations
+ * elementwise computations

-  * linear algebra, Fourier transforms
+ * linear algebra, Fourier transforms

-  * pseudorandom numbers from many distributions
+ * pseudorandom numbers from many distributions

- * Scipy provides lots more, including:
+* Scipy provides lots more, including

-  * more linear algebra
+ * more linear algebra

-  * solvers and optimization algorithms
+ * solvers and optimization algorithms

-  * matlab-compatible I/O
+ * matlab-compatible I/O

-  * I/O and signal processing for images and audio
-
-Here are the properties of numpy arrays that you really need to know.
+ * I/O and signal processing for images and audio

 .. code-block:: python

-    import numpy as np
-    a = np.random.rand(3,4,5)
-    a32 = a.astype('float32')
+    ##############################
+    # Properties of Numpy arrays
+    # that you really need to know
+    ##############################
+
+    import numpy as np          # import can rename
+    a = np.random.rand(3,4,5)   # random generators
+    a32 = a.astype('float32')   # arrays are strongly typed

-    a.ndim     # int: 3
-    a.shape    # tuple: (3,4,5)
-    a.size     # int: 60
-    a.dtype    # np.dtype object: 'float64'
-    a32.dtype  # np.dtype object: 'float32'
+    a.ndim                      # int: 3
+    a.shape                     # tuple: (3,4,5)
+    a.size                      # int: 60
+    a.dtype                     # np.dtype object: 'float64'
+    a32.dtype                   # np.dtype object: 'float32'

-These arrays can be combined with numeric operators, standard mathematical
-functions. Numpy has XXX great documentation XXX.
+Arrays can be combined with numeric operators, standard mathematical
+functions. Numpy has great `documentation <http://docs.scipy.org/doc/numpy/reference/>`_.

 Training an MNIST-ready classification neural network in pure numpy might look like this:

 .. code-block:: python

+    #########################
+    # Numpy for Training a
+    # Neural Network on MNIST
+    #########################
+
    x = np.load('data_x.npy')
    y = np.load('data_y.npy')
-    w = np.random.normal(avg=0, std=.1,
+    w = np.random.normal(
+        avg=0,
+        std=.1,
        size=(784, 500))
-    b = np.zeros(500)
+    b = np.zeros((500,))
    v = np.zeros((500, 10))
-    c = np.zeros(10)
+    c = np.zeros((10,))

+    batchsize = 100
    for i in xrange(1000):
        x_i = x[i*batchsize:(i+1)*batchsize]
        y_i = y[i*batchsize:(i+1)*batchsize]

-        hidin = N.dot(x_i, w) + b
+        hidin = np.dot(x_i, w) + b

-        hidout = N.tanh(hidin)
+        hidout = np.tanh(hidin)

-        outin = N.dot(hidout, v) + c
-        outout = (N.tanh(outin)+1)/2.0
+        outin = np.dot(hidout, v) + c
+        outout = (np.tanh(outin)+1)/2.0

        g_outout = outout - y_i
-        err = 0.5 * N.sum(g_outout**2)
+        err = 0.5 * np.sum(g_outout**2)

        g_outin = g_outout * outout * (1.0 - outout)

-        g_hidout = N.dot(g_outin, v.T)
+        g_hidout = np.dot(g_outin, v.T)
        g_hidin = g_hidout * (1 - hidout**2)

-        b -= lr * N.sum(g_hidin, axis=0)
-        c -= lr * N.sum(g_outin, axis=0)
-        w -= lr * N.dot(x_i.T, g_hidin)
-        v -= lr * N.dot(hidout.T, g_outin)
+        b -= lr * np.sum(g_hidin, axis=0)
+        c -= lr * np.sum(g_outin, axis=0)
+        w -= lr * np.dot(x_i.T, g_hidin)
+        v -= lr * np.dot(hidout.T, g_outin)


 What's missing?
 ---------------

- * Non-lazy evaluation (required by Python) hurts performance
+* Non-lazy evaluation (required by Python) hurts performance

- * Numpy is bound to the CPU
+* Numpy is bound to the CPU

- * Numpy lacks symbolic or automatic differentiation
+* Numpy lacks symbolic or automatic differentiation

-Here's how the algorithm above looks in Theano, and it runs 15 times faster if
-you have GPU (I'm skipping some dtype-details which we'll come back to):
+Now let's have a look at the same algorithm in Theano, which runs 15 times faster if
+you have GPU (I'm skipping some dtype-details which we'll come back to).

 .. code-block:: python

+    #########################
+    # Theano for Training a
+    # Neural Network on MNIST
+    #########################
+
    import theano as T
    import theano.tensor as TT

@@ -188,12 +245,13 @@ you have GPU (I'm skipping some dtype-details which we'll come back to):
    c = T.shared(np.zeros(10))

    # symbolic expression-building
-    outout = TT.tanh(TT.dot(TT.tanh(TT.dot(sx, w.T) + b), v.T) + c)
-    err = 0.5 * TT.sum(outout - sy)**2
+    hid = TT.tanh(TT.dot(sx, w) + b)
+    out = TT.tanh(TT.dot(hid, v) + c)
+    err = 0.5 * TT.sum(out - sy)**2
    gw, gb, gv, gc = TT.grad(err, [w,b,v,c])

    # compile a fast training function
-    train = function([sx, sy], cost,
+    train = T.function([sx, sy], err,
        updates={
            w:w - lr * gw,
            b:b - lr * gb,
@@ -201,6 +259,7 @@ you have GPU (I'm skipping some dtype-details which we'll come back to):
            c:c - lr * gc})

    # now do the computations
+    batchsize = 100
    for i in xrange(1000):
        x_i = x[i*batchsize:(i+1)*batchsize]
        y_i = y[i*batchsize:(i+1)*batchsize]
@@ -210,60 +269,83 @@ you have GPU (I'm skipping some dtype-details which we'll come back to):
 Theano in one slide
 -------------------

- * High-level domain-specific language tailored to numeric computation
+* High-level domain-specific language tailored to numeric computation

- * Compiles most common expressions to C for CPU and GPU.
+* Compiles most common expressions to C for CPU and GPU.

- * Limited expressivity means lots of opportunities for expression-level optimizations
-   * No function call -> global optimization
+* Limited expressivity means lots of opportunities for expression-level optimizations

-   * Strongly typed -> compiles to machine instructions
+ * No function call -> global optimization

-   * Array oriented -> parallelizable across cores
+ * Strongly typed -> compiles to machine instructions

- * Expression substitution optimizations automatically draw
-   on many backend technologies for best performance.
+ * Array oriented -> parallelizable across cores

-   * FFTW, MKL, ATLAS, Scipy, Cython, CUDA
+ * Support for looping and branching in expressions

-   * Slower fallbacks always available
+* Expression substitution optimizations automatically draw
+  on many backend technologies for best performance.
+
+ * FFTW, MKL, ATLAS, Scipy, Cython, CUDA
+
+ * Slower fallbacks always available
+
+* Automatic differentiation

- * It used to have no/poor support for internal looping and conditional
-   expressions, but these are now quite usable.
- 

 Project status
 --------------

- * Mature: theano has been developed and used since January 2008 (3.5 yrs old)
- * Driven over 40 research papers in the last few years
- * Core technology for a funded Silicon-Valley startup
- * Good user documentation
- * Active mailing list with participants from outside our lab
- * Many contributors (some from outside our lab)
- * Used to teach IFT6266 for two years
- * Used for research at Google and Yahoo.
- * Unofficial RPMs for Mandriva
- * Downloads (on June 8 2011, since last January): Pypi 780, MLOSS: 483, Assembla (`bleeding edge` repository): unknown
+* Mature: theano has been developed and used since January 2008 (3.5 yrs old)
+
+* Driven over 40 research papers in the last few years
+
+* Good user documentation
+
+* Active mailing list with participants from outside our lab
+
+* Core technology for a funded Silicon-Valley startup
+
+* Many contributors (some from outside our lab)
+
+* Used to teach IFT6266 for two years
+
+* Used for research at Google and Yahoo.
+
+* Unofficial RPMs for Mandriva
+
+* Downloads (January 2011 -  June 8 2011):
+
+ * Pypi 780
+
+ * MLOSS: 483
+
+ * Assembla (`bleeding edge` repository): unknown
+


-Why scripting for GPUs ?
------------------------
+Why scripting for GPUs?
+-----------------------

 They *Complement each other*:

- GPUs are everything that scripting/high level languages are not
+* GPUs are everything that scripting/high level languages are not
+
+ * Highly parallel
+
+ * Very architecture-sensitive
+
+ * Built for maximum FP/memory throughput

-  - Highly parallel
-  - Very architecture-sensitive
-  - Built for maximum FP/memory throughput
-  - So hard to program that meta-programming is easier.
+ * So hard to program that meta-programming is easier.

- CPU: largely restricted to control
+* CPU: largely restricted to control

-  - Optimized for sequential code and low latency (rather than high throughput)
-  - Tasks (1000/sec)
-  - Scripting fast enough
+ * Optimized for sequential code and low latency (rather than high throughput)
+
+ * Tasks (1000/sec)
+
+ * Scripting fast enough

 Best of both: scripted CPU invokes JIT-compiled kernels on GPU.

@@ -271,28 +353,41 @@ Best of both: scripted CPU invokes JIT-compiled kernels on GPU.
 How Fast are GPUs?
 ------------------

- - Theory:
+* Theory
+
+ * Intel Core i7 980 XE (107Gf/s float64) 6 cores
+
+ * NVIDIA C2050 (515 Gf/s float64, 1Tf/s float32) 480 cores
+
+ * NVIDIA GTX580 (1.5Tf/s float32) 512 cores
+
+ * GPUs are faster, cheaper, more power-efficient
+
+* Practice (our experience)
+
+ * Depends on algorithm and implementation!
+
+ * Reported speed improvements over CPU in lit. vary *widely* (.01x to 1000x)
+
+ * Matrix-matrix multiply speedup: usually about 10-20x.
+
+ * Convolution speedup: usually about 15x.
+
+ * Elemwise speedup: slower or up to 100x (depending on operation and layout)

-  - Intel Core i7 980 XE (107Gf/s float64) 6 cores
-  - NVIDIA C2050 (515 Gf/s float64, 1Tf/s float32) 480 cores
-  - NVIDIA GTX580 (1.5Tf/s float32) 512 cores
-  - GPUs are faster, cheaper, more power-efficient
+ * Sum: can be faster or slower depending on layout.

- - Practice: 
-  - Depends on algorithm and implementation!
-  - Reported speed improvements over CPU in lit. vary *widely* (.01x to 1000x)
-  - Matrix-matrix multiply speedup: usually about 10-20x.
-  - Convolution speedup: usually about 15x.
-  - Elemwise speedup: slower or up to 100x (depending on operation and layout)
-  - Sum: can be faster or slower depending on layout.
+* Benchmarking is delicate work...

- - Benchmarking is delicate work...
-   - How to control quality of implementation?
-     - How much time was spent optimizing CPU vs GPU code?
-   - Theano goes up to 100x faster on GPU because it uses only one CPU core
-   - Theano can be linked with multi-core capable BLAS (GEMM and GEMV)
+ * How to control quality of implementation?

- - If you see speedup > 100x, the benchmark is probably not fair.
+  * How much time was spent optimizing CPU vs GPU code?
+
+ * Theano goes up to 100x faster on GPU because it uses only one CPU core
+
+ * Theano can be linked with multi-core capable BLAS (GEMM and GEMV)
+
+* If you see speedup > 100x, the benchmark is probably not fair.


 Software for Directly Programming a GPU
@@ -300,15 +395,27 @@ Software for Directly Programming a GPU

 Theano is a meta-programmer, doesn't really count.

- - CUDA: C extension by NVIDIA 
-   - Vendor-specific
-   - Numeric libraries (BLAS, RNG, FFT) maturing.
- - OpenCL: multi-vendor version of CUDA
-   - More general, standardized
-   - Fewer libraries, less adoption.
- - PyCUDA: python bindings to CUDA driver interface
-   - Python interface to CUDA
-   - Memory management of GPU objects
-   - Compilation of code for the low-level driver
-   - Makes it easy to do GPU meta-programming from within Python
- - PyOpenCL: PyCUDA for PyOpenCL
+* CUDA: C extension by NVIDIA 
+
+ * Vendor-specific
+
+ * Numeric libraries (BLAS, RNG, FFT) maturing.
+
+* OpenCL: multi-vendor version of CUDA
+
+ * More general, standardized
+
+ * Fewer libraries, less adoption.
+
+* PyCUDA: python bindings to CUDA driver interface
+
+ * Python interface to CUDA
+
+ * Memory management of GPU objects
+
+ * Compilation of code for the low-level driver
+
+ * Makes it easy to do GPU meta-programming from within Python
+
+* PyOpenCL: PyCUDA for PyOpenCL
+
--- a/doc/cifarSC2011/theano.txt
+++ b/doc/cifarSC2011/theano.txt
@@ -8,46 +8,46 @@ Theano
 Pointers
 --------

- http://deeplearning.net/software/theano/
- Announcements mailing list: http://groups.google.com/group/theano-announce
- User mailing list: http://groups.google.com/group/theano-users
- Deep Learning Tutorials: http://www.deeplearning.net/tutorial/
- Installation: https://deeplearning.net/software/theano/install.html
+* http://deeplearning.net/software/theano/
+* Announcements mailing list: http://groups.google.com/group/theano-announce
+* User mailing list: http://groups.google.com/group/theano-users
+* Deep Learning Tutorials: http://www.deeplearning.net/tutorial/
+* Installation: https://deeplearning.net/software/theano/install.html


 Description
 -----------

- Mathematical symbolic expression compiler
- Dynamic C/CUDA code generation
- Efficient symbolic differentiation
+* Mathematical symbolic expression compiler
+* Dynamic C/CUDA code generation
+* Efficient symbolic differentiation
 
-  - Theano computes derivatives of functions with one or many inputs.
+  * Theano computes derivatives of functions with one or many inputs.

- Speed and stability optimizations
+* Speed and stability optimizations

-  - Gives the right answer for ``log(1+x)`` even if x is really tiny.
+  * Gives the right answer for ``log(1+x)`` even if x is really tiny.
  
- Works on Linux, Mac and Windows
- Transparent use of a GPU
+* Works on Linux, Mac and Windows
+* Transparent use of a GPU

-  - float32 only for now (working on other data types)
-  - Doesn't work on Windows for now
-  - On GPU data-intensive calculations are typically between 6.5x and 44x faster. We've seen speedups up to 140x
+  * float32 only for now (working on other data types)
+  * Doesn't work on Windows for now
+  * On GPU data-intensive calculations are typically between 6.5x and 44x faster. We've seen speedups up to 140x

- Extensive unit-testing and self-verification
+* Extensive unit-testing and self-verification

-  - Detects and diagnoses many types of errors
+  * Detects and diagnoses many types of errors
  
- On CPU, common machine learning algorithms are 1.6x to 7.5x faster than competitive alternatives
+* On CPU, common machine learning algorithms are 1.6x to 7.5x faster than competitive alternatives

-  - including specialized implementations in C/C++, NumPy, SciPy, and Matlab
+  * including specialized implementations in C/C++, NumPy, SciPy, and Matlab

- Expressions mimic NumPy's syntax & semantics
- Statically typed and purely functional
- Some sparse operations (CPU only)
- The project was started by James Bergstra and Olivier Breuleux
- For the past 1-2 years, I have replaced Olivier as lead contributor
+* Expressions mimic NumPy's syntax & semantics
+* Statically typed and purely functional
+* Some sparse operations (CPU only)
+* The project was started by James Bergstra and Olivier Breuleux
+* For the past 1-2 years, I have replaced Olivier as lead contributor

 Simple example
 --------------
@@ -59,15 +59,13 @@ Simple example
 >>> print f([0,1,2])                   # prints `array([0,2,1026])`


-==================================  ==================================
-        Unoptimized graph                    Optimized graph
-==================================  ==================================
-.. image:: pics/f_unoptimized.png   .. image:: pics/f_optimized.png
-==================================  ==================================
+======================================================  =====================================================
+        Unoptimized graph                                    Optimized graph
+======================================================  =====================================================
+.. image:: ../hpcs2011_tutorial/pics/f_unoptimized.png   .. image:: ../hpcs2011_tutorial/pics/f_optimized.png
+======================================================  =====================================================

-Symbolic programming
-
- Paradigm shift: people need to use it to understand it
+Symbolic programming = *Paradigm shift*: people need to use it to understand it.

 Exercise 1
 -----------
@@ -91,10 +89,10 @@ Real example

 **Logistic Regression**

- GPU-ready
- Symbolic differentiation
- Speed optimizations
- Stability optimizations
+* GPU-ready
+* Symbolic differentiation
+* Speed optimizations
+* Stability optimizations

 .. code-block:: python

@@ -142,6 +140,19 @@ Real example

 **Optimizations:**

+Where are those optimization applied?
+
+* ``log(1+exp(x))``
+
+* ``1 / (1 + T.exp(var))`` (sigmoid)
+
+* ``log(1-sigmoid(var))`` (softplus, stabilisation)
+
+* GEMV (matrix-vector multiply from BLAS)
+
+* Loop fusion
+
+
 .. code-block:: python

  p_1 = 1 / (1 + T.exp(-T.dot(x, w)-b))
@@ -159,22 +170,14 @@ Real example
            updates={w:w-0.1*gw, b:b-0.1*gb})


-Where are those optimization applied?
-
- ``log(1+exp(x))``
- ``1 / (1 + T.exp(var))`` (sigmoid)
- ``log(1-sigmoid(var))`` (softplus, stabilisation)
- GEMV (matrix-vector multiply from BLAS)
- Loop fusion
-
-
 Theano flags
 ------------

 Theano can be configured with flags. They can be defined in two ways

- With an environment variable: ``THEANO_FLAGS="mode=ProfileMode,ProfileMode.profile_memory=True"``
- With a configuration file that defaults to ``~.theanorc``
+* With an environment variable: ``THEANO_FLAGS="mode=ProfileMode,ProfileMode.profile_memory=True"``
+
+* With a configuration file that defaults to ``~/.theanorc``


 Exercise 2
@@ -261,57 +264,69 @@ Modify and execute the example to run on CPU with floatX=float32
 GPU
 ---

- Only 32 bit floats are supported (being worked on)
- Only 1 GPU per process
- Use the Theano flag ``device=gpu`` to tell to use the GPU device
+* Only 32 bit floats are supported (being worked on)
+* Only 1 GPU per process
+* Use the Theano flag ``device=gpu`` to tell to use the GPU device
  
-  - Use ``device=gpu{0, 1, ...}`` to specify which GPU if you have more than one
-  - Shared variables with float32 dtype are by default moved to the GPU memory space
+ * Use ``device=gpu{0, 1, ...}`` to specify which GPU if you have more than one
+ * Shared variables with float32 dtype are by default moved to the GPU memory space

- Use the Theano flag ``floatX=float32``
+* Use the Theano flag ``floatX=float32``

-  - Be sure to use ``floatX`` (``theano.config.floatX``) in your code
-  - Cast inputs before putting them into a shared variable
-  - Cast "problem": int32 with float32 to float64
+ * Be sure to use ``floatX`` (``theano.config.floatX``) in your code
+ * Cast inputs before putting them into a shared variable
+ * Cast "problem": int32 with float32 to float64
    
-    - A new casting mechanism is being developed
-    - Insert manual cast in your code or use [u]int{8,16}
-    - Insert manual cast around the mean operator (which involves a division by the length, which is an int64!)
+  * A new casting mechanism is being developed
+  * Insert manual cast in your code or use [u]int{8,16}
+  * Insert manual cast around the mean operator (which involves a division by the length, which is an int64!)



 Exercise 3
 -----------

- Modify and execute the example of `Exercise 2`_ to run with floatX=float32 on GPU
- Time with: ``time python file.py``
+* Modify and execute the example of `Exercise 2`_ to run with floatX=float32 on GPU
+
+* Time with: ``time python file.py``

 Symbolic variables
 ------------------

- # Dimensions
+* # Dimensions
    
-  - T.scalar, T.vector, T.matrix, T.tensor3, T.tensor4
+ * T.scalar, T.vector, T.matrix, T.tensor3, T.tensor4

- Dtype
+* Dtype

-  - T.[fdczbwil]vector (float32, float64, complex64, complex128, int8, int16, int32, int64)
-  - T.vector to floatX dtype
-  - floatX: configurable dtype that can be float32 or float64.
+ * T.[fdczbwil]vector (float32, float64, complex64, complex128, int8, int16, int32, int64)

- Custom variable
-  - All are shortcuts to: ``T.tensor(dtype, broadcastable=[False]*nd)``
-  - Other dtype: uint[8,16,32,64], floatX
+ * T.vector to floatX dtype
+
+ * floatX: configurable dtype that can be float32 or float64.
+
+* Custom variable
+
+ * All are shortcuts to: ``T.tensor(dtype, broadcastable=[False]*nd)``
+
+ * Other dtype: uint[8,16,32,64], floatX

 Creating symbolic variables: Broadcastability
-  - Remember what I said about broadcasting?
-  - How to add a row to all rows of a matrix?
-  - How to add a column to all columns of a matrix?
+
+* Remember what I said about broadcasting?
+
+* How to add a row to all rows of a matrix?
+
+* How to add a column to all columns of a matrix?
+
+
+Details regarding symbolic broadcasting...
  
+* Broadcastability must be specified when creating the variable

- Broadcastability must be specified when creating the variable
- The only shorcut with broadcastable dimensions are: **T.row** and **T.col**
- For all others: ``T.tensor(dtype, broadcastable=([False or True])*nd)``
+* The only shorcut with broadcastable dimensions are: **T.row** and **T.col**
+
+* For all others: ``T.tensor(dtype, broadcastable=([False or True])*nd)``


 Differentiation details
@@ -319,11 +334,15 @@ Differentiation details

 >>> gw,gb = T.grad(cost, [w,b])

- T.grad works symbolically: takes and returns a Theano variable
- T.grad can be compared to a macro: it can be applied multiple times
- T.grad takes scalar costs only
- Simple recipe allows to compute efficiently vector x Jacobian and vector x Hessian
- We are working on the missing optimizations to be able to compute efficently the full Jacobian and Hessian and Jacobian x vector
+* T.grad works symbolically: takes and returns a Theano variable
+
+* T.grad can be compared to a macro: it can be applied multiple times
+
+* T.grad takes scalar costs only
+
+* Simple recipe allows to compute efficiently vector x Jacobian and vector x Hessian
+
+* We are working on the missing optimizations to be able to compute efficently the full Jacobian and Hessian and Jacobian x vector



@@ -332,20 +351,20 @@ Benchmarks

 Example:

- Multi-layer perceptron
- Convolutional Neural Networks
- Misc Elemwise operations
+* Multi-layer perceptron
+* Convolutional Neural Networks
+* Misc Elemwise operations

 Competitors: NumPy + SciPy, MATLAB, EBLearn, Torch5, numexpr

- EBLearn, Torch5: specialized libraries written by practitioners specifically for these tasks
- numexpr: similar to Theano, 'virtual machine' for elemwise expressions
+* EBLearn, Torch5: specialized libraries written by practitioners specifically for these tasks
+* numexpr: similar to Theano, 'virtual machine' for elemwise expressions

 **Multi-Layer Perceptron**:

 60x784 matrix times 784x500 matrix, tanh, times 500x10 matrix, elemwise, then all in reverse for backpropagation

-.. image:: pics/mlp.png
+.. image:: ../hpcs2011_tutorial/pics/mlp.png

 **Convolutional Network**: 

@@ -353,12 +372,12 @@ Competitors: NumPy + SciPy, MATLAB, EBLearn, Torch5, numexpr
 downsampled to 6x50x50, tanh, convolution with 16 6x7x7 filter, elementwise
 tanh, matrix multiply, softmax elementwise, then in reverse

-.. image:: pics/conv.png
+.. image:: ../hpcs2011_tutorial/pics/conv.png

 **Elemwise**

- All on CPU
- Solid blue: Theano
- Dashed Red: numexpr (without MKL)
+* All on CPU
+* Solid blue: Theano
+* Dashed Red: numexpr (without MKL)

-.. image:: pics/multiple_graph.png
+.. image:: ../hpcs2011_tutorial/pics/multiple_graph.png
--- a/doc/conf.py
+++ b/doc/conf.py
@@ -51,9 +51,9 @@ copyright = '2008--2011, LISA lab'
 # other places throughout the built documents.
 #
 # The short X.Y version.
-version = '0.4'
+version = '0.4.1'
 # The full version, including alpha/beta/rc tags.
-release = '0.4.0'
+release = '0.4.1rc1'

 # There are two options for replacing |today|: either, you set today to some
 # non-false value, then it is used:

--- a/doc/internal/how_to_release.txt
+++ b/doc/internal/how_to_release.txt
@@ -51,7 +51,7 @@ You will need to commit the previous changes, tag the resulting version, and
 push that into the original repository. The syntax is something like the
 following::

-    hg commit -m"modifications for 0.X release" setup.py doc/conf.py NEWS.txt
+    hg commit -m"modifications for 0.X release" setup.py doc/conf.py NEWS.txt HISTORY.txt theano/configdefaults.py doc/library/config.txt
    hg tag 0.X
    hg push


--- a/doc/library/config.txt
+++ b/doc/library/config.txt
@@ -245,7 +245,7 @@ import theano and print the config variable, as in:

 .. attribute:: config.warn.ignore_bug_before

-    String value: 'None', 'all', '0.3', '0.4'
+    String value: 'None', 'all', '0.3', '0.4', '0.4.1'

    Default: 'None'


--- a/doc/tutorial/gradients.txt
+++ b/doc/tutorial/gradients.txt
@@ -98,7 +98,7 @@ In order to compute the Jacobian of some function ``y`` with respect to some
 parameter ``x`` we need to use the ``scan``. What we do is to loop over the
 entries in ``y`` and compute the gradient of ``y[i]`` with respect to ``x``.

-.. node::
+.. note::
    
    ``scan`` is a generic op in Theano that allows writting in a symbolic
    manner all kind of recurrent equations. While in principle, creating

--- a/doc/tutorial/index.txt
+++ b/doc/tutorial/index.txt
@@ -32,6 +32,7 @@ you out.
    symbolic_graphs
    modes
    aliasing
+    loop
    using_gpu
    pycuda
    shape_info

--- a/doc/tutorial/loop.txt
+++ b/doc/tutorial/loop.txt
+.. _tutloop:
+
+====
+Loop
+====
+
+You can use :ref:`Scan <lib_scan>` to do all type of loop in Theano. All the documentation about it is in the library for now.
--- a/setup.py
+++ b/setup.py
@@ -47,8 +47,8 @@ AUTHOR_EMAIL        = "theano-dev@googlegroups.com"
 PLATFORMS           = ["Windows", "Linux", "Solaris", "Mac OS-X", "Unix"]
 MAJOR               = 0
 MINOR               = 4
-MICRO               = 0
-SUFFIX              = ""  # Should be blank except for rc's, betas, etc.
+MICRO               = 1
+SUFFIX              = "rc1"  # Should be blank except for rc's, betas, etc.
 ISRELEASED          = False

 VERSION             = '%d.%d.%d%s' % (MAJOR, MINOR, MICRO, SUFFIX)

--- a/theano/compile/function_module.py
+++ b/theano/compile/function_module.py
@@ -814,7 +814,7 @@ def insert_deepcopy(env, wrapped_inputs, wrapped_outputs):

    assert len(wrapped_inputs) == len(env.inputs)
    assert len(wrapped_outputs) == len(env.outputs)
-
+    reason = "insert_deepcopy"
    updated_env_inputs = [env_i for i, env_i in zip(wrapped_inputs, env.inputs) if getattr(i, 'update', False)]

    # We can't use env.inputs as this don't include Constant Value.
@@ -830,9 +830,11 @@ def insert_deepcopy(env, wrapped_inputs, wrapped_outputs):
            # and not(wrapped_outputs[i].borrow and wrapped_outputs[j].borrow):
            if env.outputs[j] in views_of_output_i:
                if wrapped_outputs[i].borrow and wrapped_outputs[j].borrow:
-                    env.change_input('output',i, view_op(env.outputs[i]))
+                    env.change_input('output',i, view_op(env.outputs[i]),
+                                     reason=reason)
                else:
-                    env.change_input('output', i, deep_copy_op(env.outputs[i]))
+                    env.change_input('output', i, deep_copy_op(env.outputs[i]),
+                                     reason=reason)
                copied = True
                break

@@ -850,16 +852,20 @@ def insert_deepcopy(env, wrapped_inputs, wrapped_outputs):
                    if input_j in env.inputs:
                        j = env.inputs.index(input_j)
                        if wrapped_outputs[i].borrow and wrapped_inputs[j].borrow:
-                            env.change_input('output',i, view_op(env.outputs[i]))
+                            env.change_input('output',i, view_op(env.outputs[i]),
+                                             reason="insert_deepcopy")
                            break
                        else:
-                            env.change_input('output', i, deep_copy_op(env.outputs[i]))
+                            env.change_input('output', i, deep_copy_op(env.outputs[i]),
+                                             reason="insert_deepcopy")
                            break
                    elif wrapped_outputs[i].borrow:
-                        env.change_input('output',i, view_op(env.outputs[i]))
+                        env.change_input('output',i, view_op(env.outputs[i]),
+                                         reason="insert_deepcopy")
                        break
                    else:
-                        env.change_input('output', i, deep_copy_op(env.outputs[i]))
+                        env.change_input('output', i, deep_copy_op(env.outputs[i]),
+                                         reason="insert_deepcopy")
                        break

 NODEFAULT = ['NODEFAULT']

--- a/theano/configdefaults.py
+++ b/theano/configdefaults.py
@@ -223,7 +223,7 @@ AddConfigVar('numpy.seterr_invalid',
 ###
 AddConfigVar('warn.ignore_bug_before',
             "If 'None', we warn about all Theano bugs found by default. If 'all', we don't warn about Theano bugs found by default. If a version, we print only the warnings relative to Theano bugs found after that version. Warning for specific bugs can be configured with specific [warn] flags.",
-             EnumStr('None', 'all', '0.3','0.4', allow_override=False),
+             EnumStr('None', 'all', '0.3','0.4', '0.4.1',allow_override=False),
             in_c_key=False)

 default_0_3 = True

--- a/theano/gof/tests/test_compute_test_value.py
+++ b/theano/gof/tests/test_compute_test_value.py
@@ -283,7 +283,10 @@ class TestComputeTestValue(unittest.TestCase):
                        n_steps=k)
                assert False
            except ValueError, e:
-                assert e.message.startswith("shape mismatch")
+                # The first message is for numpy before 1.6
+                # The second is a new message in numpy 1.6
+                assert (e.message.startswith("shape mismatch") or
+                        e.message.startswith("operands could not be broadcast together with shapes"))

        finally:
            theano.config.compute_test_value = orig_compute_test_value

--- a/theano/gof/tests/test_lazy.py
+++ b/theano/gof/tests/test_lazy.py
@@ -84,11 +84,6 @@ class IfElseIfElseIf(PureOp):
 class NotImplementedOp(PureOp):
    class E(Exception): pass

-    def __eq__(self, other):
-        return type(self) == type(other)
-    def __hash__(self):
-        return hash(type(self))
-
    def make_node(self, x):
        return Apply(self, [x], [x.type()])
    def make_thunk(self, node, storage_map, compute_map, no_recycling):

--- a/theano/misc/check_blas.py
+++ b/theano/misc/check_blas.py
@@ -65,7 +65,9 @@ def execute(execute=True, verbose=True):
        t1=time.time()
    if verbose and execute:
        print
-        print 'this execution time took %.2fs'%(t1-t0)
+        print 'This execution time took %.2fs'%(t1-t0)
+        print
+        print 'Try to run this script a few times. Experience show that the first time is not as fast as followings call. The difference is not big, but consistent.'
    return t1-t0


@@ -103,7 +105,7 @@ if __name__ == "__main__":
            * manually compiled numpy and ATLAS with 2 threads
            * goto 1.26 with 1, 2, 4 and 8 threads.
            * goto2 1.13 compiled with multiple thread enabled.
-            
+
                          Xeon   Xeon   Xeon  Core2 i7    i7     Xeon   Xeon
        lib/nb threads    E5345  E5430  E5450 E8500 930   950    X5560  X5550

@@ -139,6 +141,8 @@ if __name__ == "__main__":
        M2070/3.2         0.32s
        GTX470/3.0        0.34s
        GTX285/3.0        0.40s
+        C1060/3.2         0.46s
+        GTX550Ti/4.0      0.57s
        GT220/3.2RC       3.80s
        8500GT/3.0       10.68s
        """

--- a/theano/misc/tests/test_pycuda_theano_simple.py
+++ b/theano/misc/tests/test_pycuda_theano_simple.py
@@ -19,7 +19,7 @@ if not theano.misc.pycuda_init.pycuda_available:

 if cuda_ndarray.cuda_available == False:
    from nose.plugins.skip import SkipTest
-    raise SkipTest('Optional package cuda disabled')
+    raise SkipTest('Optional theano package cuda disabled')

 import pycuda
 import pycuda.driver as drv

--- a/theano/printing.py
+++ b/theano/printing.py
@@ -424,6 +424,9 @@ def pydotprint(fct, outfile=None,
                file to which the name of the scan op is concatenated and
                the index in the toposort of the scan.
                This index can be printed in the graph with the option with_ids.
+    :param var_with_name_simple: If true and a variable have a name,
+                we will print only the variable name.
+                Otherwise, we concatenate the type to the var name.

    In the graph, box are an Apply Node(the execution of an op) and ellipse are variable.
    If variable have name they are used as the text(if multiple var have the same name, they will be merged in the graph).

--- a/theano/sandbox/blas_ger.py
+++ b/theano/sandbox/blas_ger.py
-"""
-
-Optimization to specialize gemm -> ger are not written
-
-Scipy implementation is not written
-
-We need to call scipy.linalg.blas.[cf]blas.[sdcz]ger here in order not to lose speed against the old Outer op.
-Here is the scipy signature: ger(alpha,x,y,incx=1,incy=1,a=0.0,overwrite_x=1,overwrite_y=1,overwrite_a=0)
-
-http://www.scipy.org/doc/api_docs/SciPy.lib.blas.info.html
-
-
-
-C implementation is not written.
-
-Tests are not written.
-"""
-class GER(Op):
-    """
-    General rank-1 update
-    A <- A + a x' y
-
-    For matrix A, vectors x, y, and scalar a.
-    """
-    def __init__(self, inplace):
-        self.inplace = bool(inplace)
-        if self.inplace:
-            self.destroy_map = {0: [0]}
-
-    def __hash__(self):
-        return hash((type(self), self.inplace))
-
-    def __eq__(self, other):
-        return hash((type(self), self.inplace))
-
-    def make_node(self, *inputs):
-        inputs = map(as_tensor_variable, inputs)
-        A, a, x, y = inputs
-
-        nx = x.type.ndim
-        ny = y.type.ndim
-
-        if nx != 1: raise TypeError('non-vector arg0 to outer()', x)
-        if ny != 1: raise TypeError('not-vector arg1 to outer()', y)
-
-        if A.dtype != a.dtype:
-            raise TypeError('dtype mismatch', (A.dtype, a.dtype))
-        if A.dtype != x.dtype:
-            raise TypeError('dtype mismatch', (A.dtype, x.dtype))
-        if A.dtype != y.dtype:
-            raise TypeError('dtype mismatch', (A.dtype, y.dtype))
-
-        return Apply(self, inputs, [A.type()])
-
-    def perform(self, node, inp, out):
-        A, a, x, y = inp
-        if not self.inplace:
-            A = A.copy()
-        A += a * numpy.outer(x, y)
-        out[0][0] = A
-
-    # grad not needed because this is put in during optimization
-
-    def __str__(self):
-        return "GER"
--- a/theano/scan_module/scan.py
+++ b/theano/scan_module/scan.py
@@ -297,6 +297,14 @@ def scan( fn
        loop are done (see ``theano.function`` for details about
        possible values and their meaning).

+    :param profile:
+        Flag or string. If true, or different from the empty string, a
+        profile object will be created and attached to the inner graph of
+        scan. In case ``profile`` is True, the profile object will have the
+        name of the scan instance, otherwise it will have the passed string.
+        Profile object collect (and print) information only when running the
+        inner graph with the new cvm linker ( with default modes,
+        other linkers this argument is useless)

    :rtype: tuple
    :return: tuple of the form (outputs, updates); ``outputs`` is either a

--- a/theano/scan_module/scan_opt.py
+++ b/theano/scan_module/scan_opt.py
@@ -17,17 +17,16 @@ import numpy
 import sys

 import theano
-from theano import tensor, scalar
-from theano.tensor import opt, TensorType, get_constant_value
+from theano import tensor
+from theano.tensor import opt, get_constant_value
 from theano import gof
 from theano.compile import optdb
-from theano.gof.opt import EquilibriumOptimizer
 from theano import config
 from theano.compile.function_module import deep_copy_op

 import scan_op
 import scan_utils
-from scan_utils import clone, equal_computations, find_up, scan_args
+from scan_utils import equal_computations, find_up, scan_args
 from theano.gof.opt import pre_constant_merge, pre_greedy_local_optimizer

 # Logging function for sending warning or info
@@ -454,12 +453,12 @@ class ScanSaveMem(gof.Optimizer):
                    if i > op.n_mit_mot:
                        try:
                            length = shape_of[out][0]
-                        except:
+                        except Exception:
                            length = node.inputs[0] + init_l[i]
                    else:
                        try:
                            length = shape_of[out][0]
-                        except:
+                        except Exception:
                            length = out.shape[0]
                    cf_slice = tensor.basic.get_canonical_form_slice(
                                                    this_slice[0], length)
@@ -556,7 +555,7 @@ class ScanSaveMem(gof.Optimizer):
                    else:
                        try:
                            length = shape_of[out][0]
-                        except:
+                        except Exception:
                            length = out.shape[0]
                    cf_slice = tensor.basic.get_canonical_form_slice(
                                                    this_slice[0],length)
@@ -1124,5 +1123,3 @@ scan_seqopt.register('scanOp_merge_inouts',
                     3,
                     'fast_run',
                     'scan')
-
-
--- a/theano/tensor/basic.py
+++ b/theano/tensor/basic.py
@@ -158,7 +158,7 @@ def as_tensor_variable(x, name=None, ndim=None):
    except TypeError:
        try:
            str_x = str(x)
-        except:
+        except Exception, e:
            str_x = repr(x)
        raise TypeError("Cannot convert %s to TensorType" % str_x, type(x))

@@ -340,7 +340,7 @@ def constant_or_value(x, rtype, name=None, ndim=None, dtype=None):
        else:
            # leave the shape out of the type
            return rtype(TensorType(dtype = x_.dtype, broadcastable = bcastable), x_, name=name)
-    except:
+    except Exception:
        raise TypeError("Could not convert %s to TensorType" % x, type(x))

 def constant(x, name=None, ndim=None, dtype=None):
@@ -425,7 +425,7 @@ def get_constant_value(v):
        try:
            numpy.complex(data) #works for all numeric scalars
            return data
-        except:
+        except Exception:
            raise TypeError('v.data is non-numeric, non-scalar, or has more than one unique value', v)
    if v.owner:
        if isinstance(v.owner.op, Alloc):
@@ -1361,7 +1361,7 @@ class TensorConstantSignature(tuple):
            return False
        try:
            (t0, d0), (t1,d1) = self, other
-        except:
+        except Exception, e:
            return False
        #N.B. compare shape to ensure no broadcasting in ==
        if t0 != t1 or d0.shape != d1.shape:
@@ -1994,7 +1994,7 @@ def max(x, axis='DEFAULT'):
    try:
        const = get_constant_value(axis)
        return CAReduce(scal.maximum,list(const))(x)
-    except:
+    except Exception:
        return max_and_argmax(x,axis)[0]

 @constructor
@@ -2873,7 +2873,7 @@ def extract_constant(x):
    '''
    try:
        x = get_constant_value(x)
-    except:
+    except Exception:
        pass
    if isinstance(x, scal.ScalarVariable):
        if x.owner and isinstance(x.owner.op, ScalarFromTensor):
@@ -4398,7 +4398,7 @@ class Reshape(Op):
                    ', should be %i' % (len(shp), self.ndim), shp)
        try:
            out[0] = numpy.reshape(x, shp)
-        except:
+        except Exception, e:
            raise ValueError('Cannot reshape input of shape %s to shape %s' % (x.shape,shp))
    def grad(self, inp, grads):
        x, shp = inp
@@ -4593,7 +4593,7 @@ class ARange(Op):
            try:
                v = get_constant_value(var)
                return numpy.all(v == value)
-            except:
+            except Exception:
                pass
            return False


--- a/theano/tensor/nnet/ConvTransp3D.py
+++ b/theano/tensor/nnet/ConvTransp3D.py
@@ -12,7 +12,7 @@ class ConvTransp3D(theano.Op):
        return hash(type(self))

    def c_code_cache_version(self):
-        return (1,)
+        return (2,)

    def make_node(self, W, b, d, H, RShape = None):
        """
@@ -266,11 +266,11 @@ class ConvTransp3D(theano.Op):

                                       for (int i = 0; i < batchSize; i++) {
                                        for (int r = 0; r < videoHeight; r++) {
-                                         const int frc = std::max(0.0, ceil(float(r-filterHeight+1)/float(dr)));
+                                         const int frc = (int)std::max(0.0f, ceilf(float(r-filterHeight+1)/float(dr)));
                                         for (int c = 0; c < videoWidth; c++) {
-                                          const int fcc = std::max(0.0, ceil(float(c-filterWidth +1)/float(dc)));
+                                          const int fcc = (int)std::max(0.0f, ceilf(float(c-filterWidth +1)/float(dc)));
                                          for (int t = 0; t < videoDur; t++) {
-                                           const int ftc = std::max(0.0, ceil(float(t-filterDur +1)  /float(dt)));
+                                           const int ftc = (int)std::max(0.0f, ceilf(float(t-filterDur +1)  /float(dt)));

                                           long long Rpost = i * %(R)s->strides[0] + r * %(R)s->strides[1] + c * %(R)s->strides[2] + t * %(R)s->strides[3];


--- a/theano/tensor/tensor_grad.py
+++ b/theano/tensor/tensor_grad.py
@@ -117,7 +117,7 @@ def Rop(f, wrt, eval_points):
        return rval


-def Lop(f, wrt, eval_points, consider_constant=[], warn_type=False,
+def Lop(f, wrt, eval_points, consider_constant=None, warn_type=False,
         disconnected_inputs='raise'):
    """
    Computes the L operation on `f` wrt to `wrt` evaluated at points given
@@ -140,6 +140,8 @@ def Lop(f, wrt, eval_points, consider_constant=[], warn_type=False,
        indices that specify both the position within a list and all
        coordinates of the tensor element in the last
    """
+    if consider_constant is None:
+        consider_constant = []

    if not isinstance(f, TensorVariable):
        raise TypeError('In tensor.Lop(), cost argument should be a TensorVariable.', f)
@@ -155,7 +157,6 @@ def Lop(f, wrt, eval_points, consider_constant=[], warn_type=False,
            list(inputs) + list(consider_constant),
            warn_type=warn_type)

-
    # Note : If p is not in gmap there can be several reasons, among which
    # is the fact that p might not be part of the computational graph. A
    # simple example is that for a+b for e.g. a[0] is not part of the graph,
@@ -196,7 +197,7 @@ def Lop(f, wrt, eval_points, consider_constant=[], warn_type=False,
 # Gradient
 #########################

-def grad(cost, wrt, g_cost=None, consider_constant=[], warn_type=False,
+def grad(cost, wrt, g_cost=None, consider_constant=None, warn_type=False,
         disconnected_inputs='raise'):
    """
    :type cost: Scalar (0-dimensional) `Variable`
@@ -228,6 +229,9 @@ def grad(cost, wrt, g_cost=None, consider_constant=[], warn_type=False,
    `theano.gradient.grad_sources_inputs``.

    """
+    if consider_constant is None:
+        consider_constant = []
+
    if not isinstance(cost, TensorVariable):
        raise TypeError('In tensor.grad(), cost argument should be a TensorVariable.', cost)


--- a/theano/tensor/tests/test_basic.py
+++ b/theano/tensor/tests/test_basic.py
@@ -7,6 +7,7 @@ import unittest
 from nose.plugins.skip import SkipTest
 import numpy
 from numpy.testing import dec
+from numpy.testing.noseclasses import KnownFailureTest

 from theano.tensor import *
 from theano.tensor import basic as tensor # for hidden symbols
@@ -4736,6 +4737,22 @@ class test_arithmetic_cast(unittest.TestCase):
                                    config.int_division == 'floatX'):
                                    assert theano_dtype == config.floatX
                                    continue
+                                if (cfg == 'numpy+floatX' and
+                                    a_type == 'complex128' and
+                                    b_type == 'float32' and
+                                    combo == ('scalar', 'array') and
+                                    numpy.__version__ == '1.6.0' and
+                                    theano_dtype == 'complex128' and
+                                    numpy_dtypes == ['complex64',
+                                                     'complex64']):
+                                    # In numpy 1.6.0 adding a complex128 with
+                                    # a float32 may result in a complex64. This
+                                    # may be a bug (investigation is currently
+                                    # in progress), so in the meantime we just
+                                    # mark this test as a known failure.
+                                    raise KnownFailureTest('Known issue with '
+                                            'numpy 1.6.0, see #761')
+
                                # In any other situation: something wrong is
                                # going on!
                                assert False

--- a/theano/tensor/tests/test_elemwise.py
+++ b/theano/tensor/tests/test_elemwise.py
@@ -82,7 +82,7 @@ class test_Broadcast(unittest.TestCase):

            self.assertTrue((f(xv, yv) == zv).all())

-            #test CAReduce.infer_shape
+            #test Elemwise.infer_shape
            #the Shape op don't implement c_code!
            if isinstance(linker,gof.PerformLinker):
                x = TensorType('float64', [(entry == 1) for entry in xsh])('x')
@@ -111,7 +111,7 @@ class test_Broadcast(unittest.TestCase):
            f(xv, yv)

            self.assertTrue((xv == zv).all())
-            #test CAReduce.infer_shape
+            #test Elemwise.infer_shape
            #the Shape op don't implement c_code!
            if isinstance(linker,gof.PerformLinker):
                x = TensorType('float64', [(entry == 1) for entry in xsh])('x')