提交 ccb85d8c authored 作者: james@X40's avatar james@X40

merge

......@@ -132,6 +132,30 @@ machinery builds a DAG (Directed Acyclic Graph) representing the
computation, a graph that theano can compile and optimize.
Automatic wrapping
------------------
All nodes in the graph must be instances of ``Apply`` or ``Result``, but
``<Op subclass>.make_node()`` typically wraps constants to satisfy those
constraints. For example, the :ref:`tensor.add` op instance is written
so that:
.. code-block:: python
e = scalar('x') + 1
builds the following graph:
.. code-block:: python
node = Apply(op = add,
inputs = [Result(type = float64_scalar, name = 'x'),
Constant(type = int64_scalar, data = 1)],
outputs = [Result(type = float64_scalar)])
e = node.outputs[0]
Graph Structures
================
......
差异被折叠。
'''Stale specification page. Upgrade this to provide useful developer doc. 2008.09.04'''
== Definitions ==
The elementwise compiler takes inputs {{{(in0, in1, in2, ...)}}}, outputs {{{(out0, out1, out2, ...)}}}, broadcast modes {{{(mod0, mod1, mod2, ...)}}} where each mode corresponds to an output as well as {{{order}}} which determines if we broadcast/accumulate over the first or last dimensions (the looping order, basically, but some operations are only valid for one particular order!).
The broadcast mode serves to calculate the rank of the corresponding output and how to map each input element to an output element:
* {{{broadcast}}}
* output.rank = max(input.rank)
* the inputs of lesser rank are broadcasted over missing dimensions
* if {{{order == f}}} ([3, 5], [5]) => [3, 5] or ([7, 8, 9], [8, 9]) => [7, 8, 9]
* if {{{order == c}}} ([3, 5], [3]) => [3, 5] or ([7, 8, 9], [7, 8]) => [7, 8, 9]
* {{{(accumulate, Accumulator)}}}
* output.rank = min(input.rank)
* for the inputs of greater rank, we use Accumulator (sum, product, etc.) to accumulate over the first dimensions
* e.g. {{{if Accumulator == sum, order == c, x.rank == 2, y.rank == 1 and z = f(x, y) then z[i] = f(sum_j(x[i, j]), y[i])}}}
* if {{{order == f}}} ([3, 5], [5]) => [5] or ([7, 8, 9], [8, 9]) => [8, 9]
* if {{{order == c}}} ([3, 5], [3]) => [3] or ([7, 8, 9], [7, 8]) => [7, 8]
{{{order == c}}} is equivalent to transposing the outputs of an {{{order == f}}} operation on transposed inputs.
This does not cover all cases of broadcasting, but I believe they cover enough. Other cases of broadcasting can be emulated with proper transposition and/or slicing.
* Could you give some examples of what kinds of broadcasting are and are not covered by your proposed implementation?
* For rank <= 2, I think only operations of the form {{{add(ones(3,1), ones(1,3)))}}} are missing. I actually didn't think of that one before now.
* In general, it only handles f(shape(head, ...), shape(head, ...), ...) and f(shape(..., tail), shape(..., tail), ...)
* Maybe I could add a general case later... the thing is that I think the ones I am considering here are easier to streamline.
Point of clarification: the order discussed here corresponds to a set of broadcasting rules, and is independent from the storage order. The 'f' order corresponds to numpy's broadcasting rules, while the 'c' order is something new and different (TODO VERIFY!)
Question: does it make sense to apply the order to the loop, or is this broadcast order something which will be local to each input argument. What happens when the elemwise compiler deals with more complex subgraphs with multiple inputs and outputs?
== The loop ==
Here is the loop for {{{order == c}}}. Check for errors!
{{{
<initialize iterators>
i1 = -1
while (++i1 < dim1) {
i2 = -1
rank_N-1_accumulator = init
while (++i2 < dim2) {
...
iN = -1
while (++iN < dimN) {
<accumulate rank N input>
<SET rank N output using broadcasted inputs>
<NEXT rank N iterator>
}
...
}
<SET rank 1 output using accumulated inputs>
<NEXT rank 1 iterator>
}
}}}
When {{{order == f}}}, the iterators ''ideally'' (but not necessarily) iterate in FORTRAN order, i.e. the while loops are on {{{dimN..dim1}}} instead of {{{dim1..dimN}}}.
{{{order}}} does __not__ represent the {{{C/F_CONTIGUOUS}}} flags of the inputs or outputs. Depending on combinations of those parameters, different loops will be used. If {{{order == f and C_CONTIGUOUS(array)}}}, for example, the loop will be on {{{dim1..dimN}}} and the matrices of lesser rank will need to be looped over several times.
An Optimizer should look at the operations in the graph and figure out whether to allocate C_CONTIGUOUS (ideal for {{{order == c}}}) or F_CONTIGUOUS (ideal for {{{order == f}}}) arrays.
== Gradient ==
The input ranks become the output ranks and gradients of the same rank as the outputs are added to the input list. If an output was given mode {{{broadcast}}}, then all inputs used to calculate it had to be broadcasted to that shape, so we must sum over the broadcasted dimensions on the gradient. The mode that we give to those inputs is therefore {{{(accumulate, sum)}}}. Inversely, if an output was given mode {{{(accumulate, sum)}}}, then all inputs used to calculate it had to be summed over those dimensions. Therefore, we give them mode {{{broadcast}}} in grad. Other accumulators than sum might prove more difficult. For example, the ith gradient for product is grad*product/x_i. Not sure how to handle that automatically.
* I don't exactly follow this paragraph, but I think I catch the general idea and it seems to me like it will work very well.
* In a nutshell for {{{broadcast}}} I calculate the gradient as normal assuming the shape is broadcasted and then I sum over what I had to broadcast.
* Could you explain why the accumulator gradient (e.g. product) can be trickier?
* I thought about it and I figured that the general case is {{{g_accum[N-i+1], g_m[i] = grad_fn(accum[i-1], m[i], g_accum[N-i])}}} where {{{g_accum}}} is the accumulated gradient wrt the accumulator {{{accum}}}. It can be short-circuited in sum and product's case: for sum, grad_fn is the identity on its last argument so {{{g_m[i] == g_accum[i] == g_accum[0] == g_z for all i}}}. In product's case, {{{accum[i-1] == product(m[1:i-1]) and g_accum[N-i] == g_z * product(m[i+1:N])}}}, multiply them together and you obtain {{{g_z * product(m)/m[i]}}} where obviously we only need to compute {{{product(m)}}} once. It's worth handling those two special cases, for the general case I don't know.
'''Seed of discussion for what an interactive debugging tool might look like. 2009.03.27.'''
== Interactive debugger ( #352 ) ==
The interactive debugger should allow the user to go step by step in a graph to debug it. It should allow setting breakpoints on arbitrary Ops or subgraphs. If we can group ops by the user's function that defined them, we could have a logical grouping of the graph into subgraphs.
The debugger should save the inputs at each step so the user loses no info through inplace operations. Ideally, the debugger should be a normal python shell enrished with commands to control the flow and all the inputs should be made available so the user can use numpy interactively on them.
Command wishlist
* py_perform (perform the current operation using the python implementation)
* c_perform (perform the current operation using the C implementation)
* perform (use the Linker's preference)
* get_inputs (get the inputs of the current op)
* set_inputs (set the inputs of the current op)
* get_outputs (get the outputs of the current op)
* set_outputs (set the outputs of the current op (bypasses its perform))
* next (perform and go to the next breakpoint)
* breakpoint (set a breakpoint on the current Op or subgraph)
* step (perform and go to the next Op or subgraph)
* step_in (go to the first Op inside the current subgraph)
* step_out (exit the subgraph containing this Op)
* Of course, normal python shell functionality!
* The global context where the debugger was called (so the user can define his own helper functions, etc.)
A good, simple way to do it would be to have those commands as methods of a structure that would be returned by a DebugLinker. This would allow an interactive session like the following:
{{{
>>> a, b, c = Tensor(), Tensor(), Tensor()
>>> d = b * c
>>> e = a + d
>>> debug = DebugLinker(Env([a, b, c], [e])).make_function()
>>> debug.set_breakpoint(d)
>>> debug.debug(10, 20, 30) # a, b, c = 10, 20, 30
Now at: Mul(b, c)
Context: d = b * c
>>> debug.get_inputs() # we are at the node d = b * c
[20, 30]
>>> debug.get_outputs()
[None]
>>> debug.py_perform()
>>> debug.get_outputs()
[600]
>>> debug.step()
Now at: Add(a, Mul)
Context: e = a + d
>>> debug.get_inputs()
[30, 600]
>>> debug.step()
Finished.
[630]
>>>
}}}
''' This has been implemented (#182). 20090327.'''
= Random Numbers =
== Requirements ==
Theano functions sometimes need random numbers.
Random operations are not as simple as other operations such as ones_like, or pow(), because the output must be different when we call the same function repeatedly. CompileFunction's new default-valued, updatable input variables make this possible. At the same time we need random streams to be repeatable, and easy to work with. So the basic requirements of our random number mechanism are:
1. Internal random number generators must be used in a clear manner, and be accessible to the caller after a function has been compiled.
1. A random-number-producing Op (from now on: {{{RandomOp}}}) should generally produce exactly the same stream of random numbers regardless of any other {{{RandomOp}}} instances in its own graph, and any other times the graph was compiled.
1. A {{{RandomOp}}}'s stream should be isolated from other {{{RandomOp}}} instances in a compiled graph, so that it is possible to adjust any one {{{RandomOp}}} independently from the others.
1. It should be easy to put the {{{RandomOp}}}s in a graph into a state from which their outputs are all independent.
1. It should be easy to save the current state of the {{{RandomOp}}}s in a graph.
1. It should be easy to re-instate a previous state of the {{{RandomOp}}}s in a graph.
== Basic Technical Spec ==
One option would be to skirt the issue by requiring users to pass all the random numbers we might need as input.
However, it is not always simple to know how many random numbers will be required because the shape of a random matrix might be computed within the graph.
The solution proposed here is to pass one or more random number generators as input to {{{theano.function}}}.
Sharing a random number generator between different {{{RandomOp}}} instances makes it difficult to producing the same stream regardless of other ops in graph, and to keep {{{RandomOps}}} isolated.
Therefore, each {{{RandomOp}}} instance in a graph will have its very own random number generator.
That random number generator is an input to the function.
In typical usage, we will use the new features of function inputs ({{{value}}}, {{{update}}}) to pass and update the rng for each {{{RandomOp}}}.
By passing RNGs as inputs, it is possible to use the normal methods of accessing function inputs to access each {{{RandomOp}}}'s rng.
In this approach it there is no pre-existing mechanism to work with the combined random number state of an entire graph.
So the proposal is to provide the missing functionality (the last three requirements) via auxiliary functions: {{{seed, getstate, setstate}}}.
== Syntax ==
{{{
#!python
# create a random generator, providing a default seed to condition how RandomOp instances are produced.
r = MetaRandom(metaseed=872364)
# create a different random generator
rr = MetaRandom(metaseed=99)
# create an Op to produce a stream of random numbers.
# This generates random numbers uniformly between 0.0 and 1.0 excluded
# u will remember that it was made from r.
u = r.uniform(shape=(3,4,5), low=0.0, high=1.0)
# create a second Op for more random numbers
# v will remember that it was made from r.
v = r.uniform(shape=(8,), low=-1.0, high=0.0)
# create a third Op with a different underlying random state
# w will remember that it was made from rr.
w = rr.uniform(shape=(), low=-10., high=10.)
# compile a function to draw random numbers
# note: un-named state inputs will be added automatically.
# note: it is not necessary to draw samples for u, even though
# u was created by r before v.
fn_v = compile.function([], [v])
# this prints some representation of v's rng in fn_v.
# The .rng property works for Result instances produced by MetaRandom.
print fn_v.state[v.rng]
# compile a function to draw each of u, v, w
# note: un-named state inputs will be added automatically
# note: This function (especially its internal state) is independent from fn_v.
fn_uvw = compile.function([], [u,v,w])
# N.B. The random number streams of fn_v and fn_uvw are independent.
assert fn_v.state[v.rng] != fn_uvw.state[v.rng]
fn_v() # returns random numbers A (according to metaseed 872364)
fn_v() # returns different random numbers B
# note that v's stream here is identical to the one in fn_v()
fn_uvw() # returns random numbers C, A, E
#explicitly re-seed v's random stream in fn_v
r.seed(fn_v, 872364)
fn_v() # returns random numbers A (as above)
fn_v() # returns random numbers B (as above)
#re-seed w's random stream in fn_uvw, but not u's or v's
rr.seed(fn_uvw, 99)
fn_uvw() # returns random numbers D, B, E
}}}
== {{{MetaRandom}}} ==
The {{{MetaRandom}}} class is the proposed interface for getting {{{RandomOp}}} instances.
There are some syntactic similarities in the way {{{MetaRandom}}} is used to construct graphs, and the way {{{numpy.RandomState}}} appears in a corresponding procedural implementation. But since theano is symbolic the meaning of {{{MetaRandom}}} is quite different.
As with {{{numpy.RandomState}}} though, a global instance of {{{MetaRandom}}} will be instantiated at import time for the scripter's convenience.
A {{{MetaRandom}}} instance will remember every {{{Result}}} that it returns during its lifetime.
When calling functions like {{{seed, setstate}}}, this list is consulted so that only the streams associated with Results returned by {{{self}}} are modified.
The use of multiple {{{MetaRandom}}} objects in a single function is mostly for debugging (e.g., when you want to synchronize two sets of random number streams).
The typical case is that only one (global) {{{MetaRandom}}} object is used to produce all the random streams in a function, so seeding (once) will reset the entire function.
{{{
class MetaRandom(obj):
def __init__(self, metaseed=<N>): ... # new functions will be initialized so that seed(fn, <N>) has no effect on output.
def __contains__(self, Result): ... # True if Result was returned by a call to self.<distribution>
def results(self): ... # Iterate over returned Result instances in creation order.
def seed(self, fn, bits): ... # See below.
def getstate(self, fn): ... # See below.
def setstate(self, fn, state): ... # See below.
def uniform(...): ... # return a Result of an Apply of a RandomOp.
# The return value is also stored internally for __contains__ and results().
def normal(...): ...
def bernoulli(...): ...
...
}}}
=== {{{MetaRandom.getstate}}} ===
{{{
def getstate(self, fn): ...
}}}
''return''::
list, set, dict, instance... something to store the random number generators associated with every one of {{{self}}}'s members in {{{fn}}}
=== {{{MetaRandom.setstate}}} ===
Re-install the random number generators in {{{rstates}}} to the {{{randomobj}}} members in {{{fn}}
{{{
def setstate(self, fn, rstates): ....
}}}
''fn::
a CompileFunction instance, generally with some Apply instances inside that are members of {{{self}}}.
''rstates''::
a structure returned by a previous call to {{{getstate}}}
''return''::
nothing
=== {{{MetaRandom.seed}}} ===
{{{
def seed(self, fn, bits): ....
}}}
''fn::
a CompileFunction instance, generally with some Apply instances inside that are members of {{{self}}}.
''bits''::
Something to use as a seed. Typically an integer or list of integers.
''return''::
None
Set the states of self's members in fn in a deterministic way based on bits.
Each member of self should generate independent samples after this call.
Seed is like a dynamically-computed setstate. If the user runs
{{{
r.seed(fn, 99)
state_99 = r.getstate(fn)
}}}
then any time afterward both {{{r.setstate(fn, state_99)}}} and {{{r.seed(fn, 99)}}} will put {{{fn}}} into the same state.
= Potential Other syntax =
{{{
#!python
# create a random state
r = RandomState(name = 'r')
# create a different random state
rr = RandomState(name = 'rr')
# create an Op to produce a stream of random numbers.
# That stream is a function of r's seed.
# This generates random numbers uniformly between 0.0 and 1.0 excluded
u = r.uniform(shape=(3,4,5), 0.0, 1.0)
# create a second Op for more random numbers
# This stream is seeded using a different function of r's seed.
# u and v should be independent
v = r.uniform(shape=(8,), -1.0, 0.0)
# create a third Op with a different underlying random state
w = rr.uniform(shape=(), -10., 10.)
# compile a function to draw random numbers
# note: it is not necessary to draw samples for u.
# we provide the seed for the RandomState r in the inputs list as a "Type 4" input
fn_v = compile.function([(r, 872364)], [v])
# compile a function to draw each of u, v, w
# we provide the seeds for the RandomStates r and rr in the inputs list as "Type 4" inputs
# note: the random state for r here is seeded independently from the one in fn_v, which means
# random number generation of fn_v and fn_uvw will not interfere. Since the seed is the
# same, it means they will produce the same sequence of tensors for the output v.
fn_uvw = compile.function([(r, 872364), (rr, 99)], [u,v,w])
fn_v() # returns random numbers A
fn_v() # returns different random numbers B
# note that v's stream here is identical to the one in fn_v()
fn_uvw() # returns random numbers C, A, E
#re-seed v's random stream in fn
fn_v.r = 872364
### Is this state readable? What should we do here:
print fn_v.r
fn() # returns random numbers A
### Is this state well-defined?
### Does there even exist a number such that fn_v.r = N would have no effect on the rng states?
print fn_v.r
fn() # returns random numbers B
#re-seed w's random stream, but not u's or v's
fn_uvw.rr = 99
fn_uvw() # returns random numbers D, B, E
}}}
'''An open proposal. This is still relevant. 20080904'''
== Issues ==
There are several issues with the current way C code is generated:
* Ops cannot declare their own persistent variables.
* Reliance on weave, but most of weave's features go unused.
* There could easily be conflicts between support code from different Ops/Results.
* It is currently impossible to specialize support code based on the self.
* Caching of the generated code for graphs is greatly suboptimal.
== Structure ==
Currently, the general structure of the generated C code is approximately as follows:
{{{
<imports>
<weave type converters>
<op/result support code>
struct my_computation {
<input/output storage>
<persistent fields>
init(<input/output storage>) { <initialize persistent fields> }
cleanup { <clean up persistent fields> }
run { <run the computation> }
};
<runner for the struct>
PyObject* instantiate(PyObject* args) {
<weave stuff>
<make up a CObject out of the runner and a my_computation instance>
<weave stuff>
}
<python exports for instantiate>
}}}
The module produced via that method then has to be used as such:
{{{
obj = module.instantiate(error_storage, input_storage, output_storage, orphan_storage)
cutils.run_cthunk(obj)
}}}
We would like to get rid of weave dependencies, avoid name conflicts with the support code and have a nicer user interface for the produced module. The proposed new structure is as follows:
{{{
<imports>
struct op1 {
<persistent variables>
<support code>
init() { <initialize persistent fields> }
cleanup { <clean up persistent fields> }
run(<inputs>) { <run the computation for op1> }
};
struct op2 { <same> };
...
struct opN { <ditto> };
struct driver {
op1 o1; op2 o2; ... opN oN;
<input storage>
<output storage>
init(<storage>) { <initialize ops, storage> }
cleanup() { <free storage?> }
run() {
<extract inputs>
o1.run(input1, input2);
o2.run(o1.output1);
...
oN.run(...);
<sync outputs>
}
}
PyObject* <name>(PyObject* inputs) {
<init driver, input/output storage>
<put inputs in input storage>
driver.run()
<free input storage>
<return output storage>
}
PyObject* <name>_driver(PyObject* storage) {
<init driver with storage>
<return driver>
}
<export <name> and <name>_driver>
}}}
Gains:
* support code can be put inside a struct and become private to the Op
* we can export several functions that can be used directly, eg {{{z = module.add(1, 2)}}}
* this won't do filtering like {{{Result.filter}}} so the usefulness is limited by that
* the sequence of operations might be clearer to read
* we can use more descriptive names in each Op struct representing its input names (if we can find them using the inspect module) without worrying about name conflicts
Losses:
* maybe gcc can't optimize it as well?
* make functions static and inline as much as possible
== Caching ==
The current way of caching is from a hash of the generated code. That is inefficient because code has to be generated each time, which might be a costly process. Furthermore, usage of hashing in sets make it difficult to ensure a consistent ordering of Ops in graphs where several orderings are valid, so the generated C code is potentially different each time. Here is a proposal for a better way to compute the hash:
* Result_hash = Result version + Result desc
* Op_hash = Op version + Op desc + input/output hashes
* Env_hash = Env version + combination of the Op hashes and their traversal order wrt a consistent traversal method
The version could be set explicitly via a {{{__version__}}} field or it could simply be equal to the file's last modification date. We could also have a {{{__nocache__}}} field indicating that code produced by the Op or Result cannot be cached.
It should also be easier to bypass the cache (eg an option to CLinker to regenerate the code).
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论