merge

ccb85d8c · james@X40 · ce9c7131 · 1b7463ed · ccb85d8c · ccb85d8c
--- a/doc/advanced_tutorial/graphstructures.txt
+++ b/doc/advanced_tutorial/graphstructures.txt
@@ -132,6 +132,30 @@ machinery builds a DAG (Directed Acyclic Graph) representing the
 computation, a graph that theano can compile and optimize.


+Automatic wrapping
+------------------
+
+All nodes in the graph must be instances of ``Apply`` or ``Result``, but
+``<Op subclass>.make_node()`` typically wraps constants to satisfy those
+constraints. For example, the :ref:`tensor.add` op instance is written
+so that:
+
+.. code-block:: python
+
+    e = scalar('x') + 1
+
+builds the following graph:
+
+.. code-block:: python
+
+    node = Apply(op = add,
+                 inputs = [Result(type = float64_scalar, name = 'x'),
+                           Constant(type = int64_scalar, data = 1)],
+                 outputs = [Result(type = float64_scalar)])
+    e = node.outputs[0]
+
+
+
 Graph Structures
 ================


--- a/doc/sandbox/ccodegen.txt
+++ b/doc/sandbox/ccodegen.txt
+'''C code is actually generated this way. Could be refreshed as developer documentation.  Olivier to review.  20080904.'''
+
+Here is a proposal on the interface to generate C code:
+
+== What will be passed to C ==
+
+For each ResultBase, the C code gets a variable called storage_<name> which contains a PyObject* pointing to a 1-element list (a sort of cell). That is the "channel" via which C and Python can communicate data. Of course, the C code will not manipulate that directly. At every execution of the C function, the PyObject* inside the storage is extracted and given the name py_<name> (its reference count is handled automatically).
+
+
+== Extracting the data for use with C ==
+
+In ResultBase, we have several methods to generate C code for particular purposes. They should return templated strings of C code (see below) but should not actually fill the template. The caller will fill it.
+
+List of template variables you can use:
+  * '''%(name)s:''' Will be filled in by a mangled name representing this ResultBase.
+  * '''%(fail)s:''' This can be inserted in the code to make the current function fail. It will proceed to cleanup everything that needs to be cleaned up. This cannot be used in any cleanup routine (and hence it is forbidden for a cleanup routine to fail!) If a code block uses %(fail)s, its corresponding cleanup block will be called first, so make sure that the cleanup can be done properly at any point where you use %(fail)s, even if you didn't allocate or INCREF everything yet.
+
+List of methods in ResultBase:
+
+'''c_declare:''' This method returns code that declares one or more variables ''without'' initializing them. These are the variables that all C code using this ResultBase will use to manipulate the data. The code should ''only'' declare variables and typedefs (no #defines, but a future extension might address this). Example: if we have a ResultBase representing a double, c_declare may simply return "double %(name)s;". ''All'' variables declared should contain the %(name)s template, but they may prefix or suffix it.
+
+'''c_init:''' This method returns code that initializes (zeros/sets to NULL, typically) the variables declared in c_declare.
+
+'''c_extract:''' This method should manipulate py_<name> to set the values of the variables declared by c_declare. For example, if we have a ResultBase representing a double, c_extract might return "%(name)s = PyFloat_AsDouble(py_%(name)s);" (plus error checking!). If something is wrong with the data provided from Python, c_extract should set an informative error message and insert %(fail)s.
+
+'''c_sync:''' This method should adjust the py_<name> variable using the values of the variables declared by c_declare. For example, if we have a ResultBase representing a double, c_sync might return "Py_XDECREF(py_%(name)s); py_%(name)s = PyFloat_FromDouble(%(name)s);". The result will then be made accessible from Python. c_sync is not allowed to fail, though it is not really cleanup code.
+
+'''c_cleanup:''' This method should clean up all the variables declared by c_declare.
+
+
+Important notes:
+  * ''Either'' c_init or c_extract will be called. The former for temporary variables and outputs, the latter for inputs. If the former is used, py_<name> will be set to Py_None regardless of what is in storage_<name>.
+  * c_sync will only be called on the outputs, not on inputs or temporaries.
+  * c_cleanup will ''always'' be called. If c_sync decides to relay some data to Python (thus ousting it from the op's scope), it should NULL any pointers that c_cleanup is not allowed to free.
+
+
+== Manipulating the data from C ==
+
+The Op class has in turn several methods that generate C code. As for ResultBase, they should return templated strings of C code (see below) but should not actually fill the template. The caller will fill it.
+
+List of template variables you can use:
+  * '''%(<variable_name>)s:''' See c_var_names. These will be substituted for mangled names.
+  * '''%(fail)s:''' This can be inserted in the code to make the current function fail. It will proceed to cleanup everything that needs to be cleaned up. This cannot be used in any cleanup routine (and hence it is forbidden for a cleanup routine to fail!). If a code block uses %(fail)s, its corresponding cleanup block will be called first, so make sure that the cleanup can be done properly at any point where you use %(fail)s, even if you didn't allocate or INCREF everything yet.
+
+'''c_var_names''': This method should return two lists, one list of strings representing the input names and one list of strings representing the output names. The actual names might be mangled by the compiler. In the template strings returned by the next few methods, you can use the names defined here. For example, if op.c_var_names() returns [['x', 'y'], ['z']], then "%(x)s" in op's templates will be the same as "%(name)s" in op.inputs[0]'s templates. This means that all the variables declared by the inputs and outputs can easily be used in the op's templates.
+
+'''c_validate_update''': This method should return code that ensures that the inputs are valid for processing by this Op (checking shapes, bounds, etc.). If anything is invalid, it should set an informative error message and use %(fail)s. Then, it should prepare the outputs: for example, if the output is a tensor, allocate a tensor, resize it appropriately and place it in the appropriate variable (see c_var_names).
+
+'''c_validate_update_cleanup''': This method should clean up any temporary storage used by c_validate_update. It is not forbidden to do it in c_validate_update itself, but this can come in handy.
+
+'''c_code''': This is the meat of the Op that actually calculates the function. If an error occurs in the process, it may use %(fail)s. It should work in place on the variables declared by its inputs and outputs and rely on their c_sync routines to relay the results to Python.
+
+'''c_code_cleanup''': This cleans up any temporary structures allocated by c_code.
+
+'''c_is_simple (field)''': Class field. Defaults to False. It is basically a compiler hint that this class represents a builtin C type or a small struct, so we can optimize its access.
+
+
+Important notes:
+  * There might be provisions in the future to skip the validate_update step if the Op can guarantee that the inputs are valid and the outputs are set up properly.
+  * It is not forbidden to just put the validate_update code in c_code. Some situations might require it, but it helps organization to segregate them.
+
+
+== Failure ==
+
+Besides cleanup code, all code has access to the %(fail)s template. For three code blocks, the generated C code will pretty much look like this:
+
+{{{
+int failure = 0;
+{
+  <code1>
+  {
+    <code2>
+    {
+      <code3>
+    label3:
+      <cleanup3>
+    }
+  label2:
+    <cleanup2>
+  }
+label1:
+  <cleanup1>
+}
+return failure;
+}}}
+
+And %(fail)s in the nth code block will take the value "{failure = n; goto label<n>;}". This means only the blocks executed up to the failure point are cleaned up and the return value indicates which block failed, which is handy for debugging.
+
+When compiling an Op, we want to sync the outputs so we can get the results from Python. In case of failure, we will not necessarily want to sync. Because of that, typical code will look like this:
+
+{{{
+int failure = 0;
+<declare input>
+<declare output>
+{
+  <extract input>
+  {
+    <extract output>
+    {
+      <perform>
+    label3:
+      <clean up perform>
+    }
+  label2:
+    if (!failure)
+      <sync output>
+    <clean up output>
+  }
+label1:
+  <clean up input>
+}
+return failure;
+}}}
+
+Furthermore, is not necessary to extract the output because we mean to overwrite it anyway. In that case, <extract output> will be a no-op, but of course we may still need to clean up or sync what <perform> will put in the declared outputs.
+
+
+== Example ResultBase ==
+
+The following ResultBase represents a double (we only care about the C part).
+
+{{{
+class Double(ResultBase):
+  <snip>
+  def c_declare(self):
+    return "double %(name)s;"
+  def c_init(self):
+    return "%(name)s = 0.0;"
+  def c_extract(self):
+    return "%(name)s = PyFloat_AsDouble(py_%(name)s);"
+  def c_cleanup(self):
+    return "" # nothing to do
+  def c_sync(self):
+    return "Py_XDECREF(py_%(name)s); py_%(name)s = PyFloat_FromDouble(%(name)s);"
+}}}
+
+
+== Example Op ==
+
+The following ResultBase represents addition of two nonnegative doubles (we only care about the C part).
+
+{{{
+class Add(Op):
+  <snip>
+  def c_var_names(self):
+    return "[['x', 'y'], ['z']]"
+  def c_validate_update(self):
+    return "if (%(x)s < 0 || %(y)s < 0) %(fail)s" # fail if x or y is negative
+  def c_validate_update_cleanup(self):
+    return "" # nothing to do
+  def c_code(self):
+    return "%(z)s = %(x)s + %(y)s;"
+  def c_code_cleanup(self):
+    return "" # nothing to do
+}}}
+
+== Generating a C function ==
+
+For the example Op, the generated C function will typically look like this:
+
+{{{
+void add(PyObject* storage_x, PyObject* storage_y, PyObject* storage_z) {
+  PyObject* py_x = PyList_GET_ITEM(storage_x, 0); Py_XINCREF(py_x); // automatic
+  PyObject* py_y = PyList_GET_ITEM(storage_y, 0); Py_XINCREF(py_y); // automatic
+  PyObject* py_z = Py_None; // we don't care what's currently in storage_z
+
+  failure = 0
+  double x; // x.c_declare
+  double y; // y.c_declare
+  double z; // z.c_declare
+  {
+    x = PyFloat_AsDouble(py_x); // x.c_extract
+    {
+      y = PyFloat_AsDouble(py_y); // y.c_extract
+      {
+        # we don't need to use z.c_extract
+        {
+          if (x < 0 || y < 0) { // add.validate_update
+            // This is automatically inserted in place of %(fail)s
+            failure = 4;
+            goto label_add_validate_update_cleanup;
+          }
+          {
+            z = x + y; // add.c_code
+          label_add_code_cleanup:
+          }
+        label_add_validate_update_cleanup:
+        }
+      label_z_sync_or_cleanup:
+        if (!failure) {
+          Py_XDECREF(py_z); // z.c_sync
+          py_z = PyFloat_FromDouble(z); // z.c_sync, the result is now available from Python!
+          PyList_SET_ITEM(storage_z, 0, py_z); // always done after _.c_sync
+        }
+        Py_XDECREF(py_z); // always done after _.c_cleanup
+      }
+    label_y_cleanup:
+      Py_XDECREF(py_y); // always done after _.c_cleanup
+    }
+  label_x_cleanup:
+    Py_XDECREF(py_x); // always done after _.c_cleanup
+  }
+  return failure;
+}
+}}}
+
+== Generating a C struct ==
+
+To accelerate processing a tad, a struct can be generated instead of a function. The struct will keep pointers to the storage where to fetch inputs and store outputs, but it will also store fields declared by outputs and temporaries' c_declare methods.
+
+Here is a sketch of the struct equivalent of the previous function:
+
+{{{
+struct add {
+  PyObject* storage_x;
+  PyObject* storage_y;
+  PyObject* storage_z;
+  double z; // z.c_declare
+
+  void init(PyObject* storage_x, PyObject* storage_y, PyObject* storage_z) {
+    <set the struct members of the same names>
+    <init the struct members corresponding to z>
+  }
+
+  void cleanup(void) {
+    <cleanup z>
+  }
+
+  void run(void) {
+    <same code as before minus z's cleanup>
+  }
+
+  add() { this->init(); }
+  ~add() { this->cleanup(); }
+};
+}}}
+
+Advantages of using a struct:
+  * Can be run several times even if we provide the storage only once.
+  * Output variables or temporary variables can reuse what they allocated the last time. This is not particularly useful with doubles (in fact it might be detrimental), but if z was a large tensor it might be interesting to recycle the memory over thousands of runs of the Op.
+
+No struct members will be made if a result's c_is_simple field is True. They will be allocated on the stack instead.
+
+
--- a/doc/sandbox/elemwise_compiler.txt
+++ b/doc/sandbox/elemwise_compiler.txt
+'''Stale specification page.  Upgrade this to provide useful developer doc. 2008.09.04'''
+== Definitions ==
+
+The elementwise compiler takes inputs {{{(in0, in1, in2, ...)}}}, outputs {{{(out0, out1, out2, ...)}}}, broadcast modes {{{(mod0, mod1, mod2, ...)}}} where each mode corresponds to an output as well as {{{order}}} which determines if we broadcast/accumulate over the first or last dimensions (the looping order, basically, but some operations are only valid for one particular order!).
+
+The broadcast mode serves to calculate the rank of the corresponding output and how to map each input element to an output element:
+
+  * {{{broadcast}}}
+    * output.rank = max(input.rank)
+    * the inputs of lesser rank are broadcasted over missing dimensions
+    * if {{{order == f}}} ([3, 5], [5]) => [3, 5] or ([7, 8, 9], [8, 9]) => [7, 8, 9]
+    * if {{{order == c}}} ([3, 5], [3]) => [3, 5] or ([7, 8, 9], [7, 8]) => [7, 8, 9]
+  * {{{(accumulate, Accumulator)}}}
+    * output.rank = min(input.rank)
+    * for the inputs of greater rank, we use Accumulator (sum, product, etc.) to accumulate over the first dimensions
+      * e.g. {{{if Accumulator == sum, order == c, x.rank == 2, y.rank == 1 and z = f(x, y) then z[i] = f(sum_j(x[i, j]), y[i])}}}
+    * if {{{order == f}}} ([3, 5], [5]) => [5] or ([7, 8, 9], [8, 9]) => [8, 9]
+    * if {{{order == c}}} ([3, 5], [3]) => [3] or ([7, 8, 9], [7, 8]) => [7, 8]
+
+{{{order == c}}} is equivalent to transposing the outputs of an {{{order == f}}} operation on transposed inputs.
+
+This does not cover all cases of broadcasting, but I believe they cover enough. Other cases of broadcasting can be emulated with proper transposition and/or slicing.
+ * Could you give some examples of what kinds of broadcasting are and are not covered by your proposed implementation?
+  * For rank <= 2, I think only operations of the form {{{add(ones(3,1), ones(1,3)))}}} are missing. I actually didn't think of that one before now.
+  * In general, it only handles f(shape(head, ...), shape(head, ...), ...) and f(shape(..., tail), shape(..., tail), ...)
+  * Maybe I could add a general case later... the thing is that I think the ones I am considering here are easier to streamline.
+
+Point of clarification: the order discussed here corresponds to a set of broadcasting rules, and is independent from the storage order.  The 'f' order corresponds to numpy's broadcasting rules, while the 'c' order is something new and different (TODO VERIFY!)
+
+Question: does it make sense to apply the order to the loop, or is this broadcast order something which will be local to each input argument.  What happens when the elemwise compiler deals with more complex subgraphs with multiple inputs and outputs?
+
+== The loop ==
+
+Here is the loop for {{{order == c}}}. Check for errors!
+
+{{{
+<initialize iterators>
+
+i1 = -1
+while (++i1 < dim1) {
+  i2 = -1
+  rank_N-1_accumulator = init
+  while (++i2 < dim2) {
+    ...
+    iN = -1
+    while (++iN < dimN) {
+      <accumulate rank N input>
+      <SET rank N output using broadcasted inputs>
+      <NEXT rank N iterator>
+    }
+    ...
+  }
+  <SET rank 1 output using accumulated inputs>
+  <NEXT rank 1 iterator>
+}
+}}}
+
+When {{{order == f}}}, the iterators ''ideally'' (but not necessarily) iterate in FORTRAN order, i.e. the while loops are on {{{dimN..dim1}}} instead of {{{dim1..dimN}}}.
+
+{{{order}}} does __not__ represent the {{{C/F_CONTIGUOUS}}} flags of the inputs or outputs. Depending on combinations of those parameters, different loops will be used. If {{{order == f and C_CONTIGUOUS(array)}}}, for example, the loop will be on {{{dim1..dimN}}} and the matrices of lesser rank will need to be looped over several times.
+
+An Optimizer should look at the operations in the graph and figure out whether to allocate C_CONTIGUOUS (ideal for {{{order == c}}}) or F_CONTIGUOUS (ideal for {{{order == f}}}) arrays.
+
+== Gradient ==
+
+The input ranks become the output ranks and gradients of the same rank as the outputs are added to the input list. If an output was given mode {{{broadcast}}}, then all inputs used to calculate it had to be broadcasted to that shape, so we must sum over the broadcasted dimensions on the gradient. The mode that we give to those inputs is therefore {{{(accumulate, sum)}}}. Inversely, if an output was given mode {{{(accumulate, sum)}}}, then all inputs used to calculate it had to be summed over those dimensions. Therefore, we give them mode {{{broadcast}}} in grad. Other accumulators than sum might prove more difficult. For example, the ith gradient for product is grad*product/x_i. Not sure how to handle that automatically.
+ * I don't exactly follow this paragraph, but I think I catch the general idea and it seems to me like it will work very well.
+  * In a nutshell for {{{broadcast}}} I calculate the gradient as normal assuming the shape is broadcasted and then I sum over what I had to broadcast.
+ * Could you explain why the accumulator gradient (e.g. product) can be trickier?
+  * I thought about it and I figured that the general case is {{{g_accum[N-i+1], g_m[i] = grad_fn(accum[i-1], m[i], g_accum[N-i])}}} where {{{g_accum}}} is the accumulated gradient wrt the accumulator {{{accum}}}. It can be short-circuited in sum and product's case: for sum, grad_fn is the identity on its last argument so {{{g_m[i] == g_accum[i] == g_accum[0] == g_z for all i}}}. In product's case, {{{accum[i-1] == product(m[1:i-1]) and g_accum[N-i] == g_z * product(m[i+1:N])}}}, multiply them together and you obtain {{{g_z * product(m)/m[i]}}} where obviously we only need to compute {{{product(m)}}} once. It's worth handling those two special cases, for the general case I don't know.
+
+
--- a/doc/sandbox/interactive_debugger.txt
+++ b/doc/sandbox/interactive_debugger.txt
+'''Seed of discussion for what an interactive debugging tool might look like. 2009.03.27.'''
+
+== Interactive debugger ( #352 ) ==
+
+The interactive debugger should allow the user to go step by step in a graph to debug it. It should allow setting breakpoints on arbitrary Ops or subgraphs. If we can group ops by the user's function that defined them, we could have a logical grouping of the graph into subgraphs.
+
+The debugger should save the inputs at each step so the user loses no info through inplace operations. Ideally, the debugger should be a normal python shell enrished with commands to control the flow and all the inputs should be made available so the user can use numpy interactively on them.
+
+Command wishlist
+ * py_perform (perform the current operation using the python implementation)
+ * c_perform (perform the current operation using the C implementation)
+ * perform (use the Linker's preference)
+ * get_inputs (get the inputs of the current op)
+ * set_inputs (set the inputs of the current op)
+ * get_outputs (get the outputs of the current op)
+ * set_outputs (set the outputs of the current op (bypasses its perform))
+ * next (perform and go to the next breakpoint)
+ * breakpoint (set a breakpoint on the current Op or subgraph)
+ * step (perform and go to the next Op or subgraph)
+ * step_in (go to the first Op inside the current subgraph)
+ * step_out (exit the subgraph containing this Op)
+ * Of course, normal python shell functionality!
+ * The global context where the debugger was called (so the user can define his own helper functions, etc.)
+
+A good, simple way to do it would be to have those commands as methods of a structure that would be returned by a DebugLinker. This would allow an interactive session like the following:
+
+{{{
+>>> a, b, c = Tensor(), Tensor(), Tensor()
+>>> d = b * c
+>>> e = a + d
+>>> debug = DebugLinker(Env([a, b, c], [e])).make_function()
+>>> debug.set_breakpoint(d)
+>>> debug.debug(10, 20, 30) # a, b, c = 10, 20, 30
+Now at: Mul(b, c)
+Context: d = b * c
+>>> debug.get_inputs() # we are at the node d = b * c
+[20, 30]
+>>> debug.get_outputs()
+[None]
+>>> debug.py_perform()
+>>> debug.get_outputs()
+[600]
+>>> debug.step()
+Now at: Add(a, Mul)
+Context: e = a + d
+>>> debug.get_inputs()
+[30, 600]
+>>> debug.step()
+Finished.
+[630]
+>>>
+}}}
+
+
--- a/doc/sandbox/randomnumbers.txt
+++ b/doc/sandbox/randomnumbers.txt
+
+''' This has been implemented (#182). 20090327.'''
+
+= Random Numbers =
+
+== Requirements ==
+
+
+Theano functions sometimes need random numbers.  
+Random operations are not as simple as other operations such as ones_like, or pow(), because the output must be different when we call the same function repeatedly.  CompileFunction's new default-valued, updatable input variables make this possible.  At the same time we need random streams to be repeatable, and easy to work with.  So the basic requirements of our random number mechanism are:
+
+ 1. Internal random number generators must be used in a clear manner, and be accessible to the caller after a function has been compiled.
+ 1. A random-number-producing Op (from now on: {{{RandomOp}}}) should generally produce exactly the same stream of random numbers regardless of any other {{{RandomOp}}} instances in its own graph, and any other times the graph was compiled.
+ 1. A {{{RandomOp}}}'s stream should be isolated from other {{{RandomOp}}} instances in a compiled graph, so that it is possible to adjust any one {{{RandomOp}}} independently from the others.
+ 1. It should be easy to put the {{{RandomOp}}}s in a graph into a state from which their outputs are all independent.
+ 1. It should be easy to save the current state of the {{{RandomOp}}}s in a graph.
+ 1. It should be easy to re-instate a previous state of the {{{RandomOp}}}s in a graph.
+
+== Basic Technical Spec ==
+
+One option would be to skirt the issue by requiring users to pass all the random numbers we might need as input.
+However, it is not always simple to know how many random numbers will be required because the shape of a random matrix might be computed within the graph.
+The solution proposed here is to pass one or more random number generators as input to {{{theano.function}}}.
+
+Sharing a random number generator between different {{{RandomOp}}} instances makes it difficult to producing the same stream regardless of other ops in graph, and to keep {{{RandomOps}}} isolated.
+Therefore, each {{{RandomOp}}} instance in a graph will have its very own random number generator.
+That random number generator is an input to the function.
+In typical usage, we will use the new features of function inputs ({{{value}}}, {{{update}}}) to pass and update the rng for each {{{RandomOp}}}.
+By passing RNGs as inputs, it is possible to use the normal methods of accessing function inputs to access each {{{RandomOp}}}'s rng.
+In this approach it there is no pre-existing mechanism to work with the combined random number state of an entire graph.
+So the proposal is to provide the missing functionality (the last three requirements) via auxiliary functions: {{{seed, getstate, setstate}}}.
+
+== Syntax ==
+
+{{{
+#!python
+# create a random generator, providing a default seed to condition how RandomOp instances are produced.
+r = MetaRandom(metaseed=872364)
+
+# create a different random generator
+rr = MetaRandom(metaseed=99) 
+
+# create an Op to produce a stream of random numbers.
+# This generates random numbers uniformly between 0.0 and 1.0 excluded
+# u will remember that it was made from r.
+u = r.uniform(shape=(3,4,5), low=0.0, high=1.0)
+
+# create a second Op for more random numbers
+# v will remember that it was made from r.
+v = r.uniform(shape=(8,), low=-1.0, high=0.0)
+
+# create a third Op with a different underlying random state
+# w will remember that it was made from rr.
+w = rr.uniform(shape=(), low=-10., high=10.)
+
+# compile a function to draw random numbers
+# note: un-named state inputs will be added automatically.
+# note: it is not necessary to draw samples for u, even though
+#       u was created by r before v.
+fn_v = compile.function([], [v])
+
+# this prints some representation of v's rng in fn_v.
+# The .rng property works for Result instances produced by MetaRandom.
+print fn_v.state[v.rng]
+
+# compile a function to draw each of u, v, w
+# note: un-named state inputs will be added automatically
+# note: This function (especially its internal state) is independent from fn_v.
+fn_uvw = compile.function([], [u,v,w])
+
+# N.B. The random number streams of fn_v and fn_uvw are independent.
+assert fn_v.state[v.rng] != fn_uvw.state[v.rng]
+
+fn_v()  # returns random numbers A (according to metaseed 872364)
+fn_v()  # returns different random numbers B
+
+# note that v's stream here is identical to the one in fn_v()
+fn_uvw() # returns random numbers C, A, E
+
+#explicitly re-seed v's random stream in fn_v
+r.seed(fn_v, 872364)
+fn_v()    # returns random numbers A (as above)
+fn_v()    # returns random numbers B (as above)
+
+#re-seed w's random stream in fn_uvw, but not u's or v's
+rr.seed(fn_uvw, 99)
+fn_uvw() # returns random numbers D, B, E
+
+}}}
+
+== {{{MetaRandom}}} ==
+
+The {{{MetaRandom}}} class is the proposed interface for getting {{{RandomOp}}} instances.
+There are some syntactic similarities in the way {{{MetaRandom}}} is used to construct graphs, and the way {{{numpy.RandomState}}} appears in a corresponding procedural implementation.  But since theano is symbolic the meaning of {{{MetaRandom}}} is quite different.
+
+As with {{{numpy.RandomState}}} though, a global instance of {{{MetaRandom}}} will be instantiated at import time for the scripter's convenience.
+
+A {{{MetaRandom}}} instance will remember every {{{Result}}} that it returns during its lifetime.
+When calling functions like {{{seed, setstate}}}, this list is consulted so that only the streams associated with Results returned by {{{self}}} are modified.
+The use of multiple {{{MetaRandom}}} objects in a single function is mostly for debugging (e.g., when you want to synchronize two sets of random number streams).
+
+The typical case is that only one (global) {{{MetaRandom}}} object is used to produce all the random streams in a function, so seeding (once) will reset the entire function.
+
+{{{
+class MetaRandom(obj):
+   def __init__(self, metaseed=<N>): ... # new functions will be initialized so that seed(fn, <N>) has no effect on output.
+   
+   def __contains__(self, Result): ...   # True if Result was returned by a call to self.<distribution>
+   def results(self): ...                # Iterate over returned Result instances in creation order.
+
+   def seed(self, fn, bits): ...         # See below.
+   def getstate(self, fn): ...           # See below.
+   def setstate(self, fn, state): ...    # See below.
+
+   def uniform(...): ...                 # return a Result of an Apply of a RandomOp.  
+                                         # The return value is also stored internally for __contains__ and results().
+   def normal(...): ...
+   def bernoulli(...): ...
+   ...
+}}}
+
+
+=== {{{MetaRandom.getstate}}} ===
+
+{{{
+def getstate(self, fn): ...
+}}}
+ ''return''::
+   list, set, dict, instance... something to store the random number generators associated with every one of {{{self}}}'s members in {{{fn}}}
+
+=== {{{MetaRandom.setstate}}} ===
+
+Re-install the random number generators in {{{rstates}}} to the {{{randomobj}}} members in {{{fn}}
+
+{{{
+def setstate(self, fn, rstates): ....
+}}}
+ ''fn::
+   a CompileFunction instance, generally with some Apply instances inside that are members of {{{self}}}.
+ ''rstates''::
+   a structure returned by a previous call to {{{getstate}}}
+ ''return''::
+   nothing
+
+
+=== {{{MetaRandom.seed}}} ===
+
+{{{
+def seed(self, fn, bits): ....
+}}}
+ ''fn::
+   a CompileFunction instance, generally with some Apply instances inside that are members of {{{self}}}.
+ ''bits''::
+   Something to use as a seed. Typically an integer or list of integers.
+ ''return''::
+   None
+
+Set the states of self's members in fn in a deterministic way based on bits.
+Each member of self should generate independent samples after this call.
+
+Seed is like a dynamically-computed setstate.  If the user runs
+{{{
+r.seed(fn, 99)
+state_99 = r.getstate(fn)
+}}}
+then any time afterward both {{{r.setstate(fn, state_99)}}} and {{{r.seed(fn, 99)}}} will put {{{fn}}} into the same state.
+
+
+
+= Potential Other syntax =
+
+
+{{{
+#!python
+# create a random state
+r = RandomState(name = 'r')
+
+# create a different random state
+rr = RandomState(name = 'rr')
+
+# create an Op to produce a stream of random numbers.
+# That stream is a function of r's seed.
+# This generates random numbers uniformly between 0.0 and 1.0 excluded
+u = r.uniform(shape=(3,4,5), 0.0, 1.0)
+
+# create a second Op for more random numbers
+# This stream is seeded using a different function of r's seed.
+# u and v should be independent
+v = r.uniform(shape=(8,), -1.0, 0.0)
+
+# create a third Op with a different underlying random state
+w = rr.uniform(shape=(), -10., 10.)
+
+# compile a function to draw random numbers
+# note: it is not necessary to draw samples for u.
+# we provide the seed for the RandomState r in the inputs list as a "Type 4" input
+fn_v = compile.function([(r, 872364)], [v])
+
+# compile a function to draw each of u, v, w
+# we provide the seeds for the RandomStates r and rr in the inputs list as "Type 4" inputs
+# note: the random state for r here is seeded independently from the one in fn_v, which means
+#       random number generation of fn_v and fn_uvw will not interfere. Since the seed is the
+#       same, it means they will produce the same sequence of tensors for the output v.
+fn_uvw = compile.function([(r, 872364), (rr, 99)], [u,v,w])
+
+
+fn_v()  # returns random numbers A
+fn_v()  # returns different random numbers B
+
+# note that v's stream here is identical to the one in fn_v()
+fn_uvw() # returns random numbers C, A, E
+
+#re-seed v's random stream in fn
+fn_v.r = 872364
+
+### Is this state readable? What should we do here:
+print fn_v.r
+
+fn()    # returns random numbers A
+
+### Is this state well-defined?  
+### Does there even exist a number such that fn_v.r = N would have no effect on the rng states?
+print fn_v.r
+
+fn()    # returns random numbers B
+
+#re-seed w's random stream, but not u's or v's
+fn_uvw.rr = 99
+fn_uvw() # returns random numbers D, B, E
+
+}}}
+
+
--- a/doc/sandbox/rethinkccodegen.txt
+++ b/doc/sandbox/rethinkccodegen.txt
+'''An open proposal.  This is still relevant. 20080904'''
+
+== Issues ==
+
+There are several issues with the current way C code is generated:
+  * Ops cannot declare their own persistent variables.
+  * Reliance on weave, but most of weave's features go unused.
+  * There could easily be conflicts between support code from different Ops/Results.
+    * It is currently impossible to specialize support code based on the self.
+  * Caching of the generated code for graphs is greatly suboptimal.
+
+== Structure ==
+
+Currently, the general structure of the generated C code is approximately as follows:
+
+{{{
+<imports>
+<weave type converters>
+<op/result support code>
+
+struct my_computation {
+  <input/output storage>
+  <persistent fields>
+  init(<input/output storage>) { <initialize persistent fields> }
+  cleanup { <clean up persistent fields> }
+  run { <run the computation> }
+};
+
+<runner for the struct>
+PyObject* instantiate(PyObject* args) {
+  <weave stuff>
+  <make up a CObject out of the runner and a my_computation instance>
+  <weave stuff>
+}
+<python exports for instantiate>
+}}}
+
+The module produced via that method then has to be used as such:
+{{{
+obj = module.instantiate(error_storage, input_storage, output_storage, orphan_storage)
+cutils.run_cthunk(obj)
+}}}
+
+We would like to get rid of weave dependencies, avoid name conflicts with the support code and have a nicer user interface for the produced module. The proposed new structure is as follows:
+
+{{{
+<imports>
+
+struct op1 {
+  <persistent variables>
+  <support code>
+  init() { <initialize persistent fields> }
+  cleanup { <clean up persistent fields> }
+  run(<inputs>) { <run the computation for op1> }
+};
+
+struct op2 { <same> };
+...
+struct opN { <ditto> };
+
+struct driver {
+  op1 o1; op2 o2; ... opN oN;
+  <input storage>
+  <output storage>
+  init(<storage>) { <initialize ops, storage> }
+  cleanup() { <free storage?> }
+  run() {
+    <extract inputs>
+    o1.run(input1, input2);
+    o2.run(o1.output1);
+    ...
+    oN.run(...);
+    <sync outputs>
+  }
+}
+
+PyObject* <name>(PyObject* inputs) {
+  <init driver, input/output storage>
+  <put inputs in input storage>
+  driver.run()
+  <free input storage>
+  <return output storage>
+}
+
+PyObject* <name>_driver(PyObject* storage) {
+  <init driver with storage>
+  <return driver>
+}
+
+<export <name> and <name>_driver>
+}}}
+
+Gains:
+  * support code can be put inside a struct and become private to the Op
+  * we can export several functions that can be used directly, eg {{{z = module.add(1, 2)}}}
+    * this won't do filtering like {{{Result.filter}}} so the usefulness is limited by that
+  * the sequence of operations might be clearer to read
+  * we can use more descriptive names in each Op struct representing its input names (if we can find them using the inspect module) without worrying about name conflicts
+
+Losses:
+  * maybe gcc can't optimize it as well?
+    * make functions static and inline as much as possible
+
+
+== Caching ==
+
+The current way of caching is from a hash of the generated code. That is inefficient because code has to be generated each time, which might be a costly process. Furthermore, usage of hashing in sets make it difficult to ensure a consistent ordering of Ops in graphs where several orderings are valid, so the generated C code is potentially different each time. Here is a proposal for a better way to compute the hash:
+  * Result_hash = Result version + Result desc
+  * Op_hash = Op version + Op desc + input/output hashes
+  * Env_hash = Env version + combination of the Op hashes and their traversal order wrt a consistent traversal method
+
+The version could be set explicitly via a {{{__version__}}} field or it could simply be equal to the file's last modification date. We could also have a {{{__nocache__}}} field indicating that code produced by the Op or Result cannot be cached.
+
+It should also be easier to bypass the cache (eg an option to CLinker to regenerate the code).
+
+
+