Randaly comments on [video] Paul Christiano’s impromptu tutorial on AIXI and TDT

Randaly 11 Apr 2012 1:28 UTC
16 points
Here’s a transcript of the first half, with my changes to the original in brackets:

OK, so AIXI is a really simply algorithm. The setting is: we have some sequence of input bits to an algorithm A. It’s a stream: i1, i2, i3, etc. And you have an output, a1, a2, a3, etc. And this works in the obvious way: if the input’s i0, the output’s a0, etc. So the algorithm’s going to be, ’A maintains distribution over possible worlds W” So it’s going to model the world W as another step, that takes its actions and then returns the next input.

[1:05]

So the world produces this first input [i1], then A produces this first output [a1], and then the world produces the next input [i2]. And so we can imagine this as being, like A is a computer, and W is everything else in the world. The input stream i1, i2, i3… is the input wire from the world, and the output stream a1, a2, a3… is the output back to the world. Physics [takes the output and the prior world-state and generates the next world-state, which provides the next input]. So, ‘A maintains a distribution over possible worlds W’ Do you know much about algorithmic information theory? Nothing too deep there, just the definitions. Listener: Such as Solomonoff Induction, Kolmogorov Complexity? So, the universal prior is just this [prior probability that’s equal to the sum of all 2^-l(p), where is l(p) is the length of any program p that outputs the given bits]

[2:02]

It starts with a description [indecipherable] a “uniform” distribution over programs. Obviously, in a uniform distribution, like in some prefix-free encoding, you want the shorter program to be more likely. So we start with a uniform distribution over programs and condition on agreement with observations, amongst all possible programs for the world that would have produced this sequence of inputs [i1, i2, i3...] in response to the outputs from AIXI. So it’s conditioned based on this uniform distribution [the universal prior]. Let’s call this step k...I’ve completely diverged from my original notation, but it’s OK.

[3:01]

So at step k, the machine has a distribution over possible worlds Wk, and it’s deciding what action to take. Along with the input, we’re gonna receive a reward signal r1, r2, r3… The goal is to maximize the reward signal. Now, once we have this reward signal and the model of the world, it’s just going to chose the action which maximizes the reward.

[4:11]

So, “Select ak = argmax(Expected Value over possible worlds Wk of r(k+1)|ak=a)” So, this notation is really lying. A lot. This [gesturing to Wk] is choosing a program to take the place of the world; it takes the output and gives out the input and reward. So when I’m conditioning here [|ak=a], I’m not really conditioning.

[5:04]

I actually mean, for each program, if you set the kth action to be a, and you run that program to see what the world would do. So |ak=a means set the kth action to be a, temporarily, then take the average over all possible worlds and see what your reward would be. And then you get this quantity for each different action the machine could choose. This is kind of the simplest algorithm- this is kind of silly, this algorithm, because it’s just very greedily maximizing reward, only doing it for the next step. And so basically, all AIXI does is it looks ahead for some horizon h, like it looks ahead h steps and tries to maximize its reward over those h steps.

[6:01]

So, just, instead it uses the algorithm: “”Select ak = argmax(Expected Value over possible worlds Mk of r(k+1)+r(k+2)+r(k+3)+...+r(k+h)|ak=a,a(k+1)=argmax(...), a(k+2)=argmax(...), …a(k+h-1)=argmax(...)) This starts off the same, up to |ak=a), but to look ahead h steps we also need to model our own behavior over this period. So what I mean is that we can use this same algorithm and [use it to find a(k+1)], our own action in subsequent rounds.

[7:14]

We’re taking into account the fact that in future rounds our model will be different, because we will have observed one more piece of information, which will allow our future selves to update their model. That’s sort of like the most naive possible thing, in some sense. I’ve glossed over a small number of technical details, but I think that was pretty good. And then you can prove optimality results, and that sort of thing.

[8:01]

If you actually run this program against the universe, [indeciherable], it will choose optimal results, aside from the fact that there’s a horizon. But over just the first h steps, the result will be optimal. OK, that’s all! Now, I guess, one issue with this model is that it artificially splits the world into the agent and the non-agent. This sounds reasonable, because it’s how we reasona about the world intuitively- there’s stuff going in inside your brain and outside your brain. But this is not really cutting reality at a joint. [indecipherable] Your brain is just another thing that exists in the world.

[9:03]

This is what I mean when I say that decision theories are Cartesian dualists, which is a poetic way to put it. The goal is now to have an agent that’s aware of the fact that it’s just more stuff happening. To motivate this, we can see what happens to AIXI if we build an approximate version and put it in the world. I’m going to gloss over the details of the approximation of Solomonoff Induction that we use. (In real life, Solomonoff Induction takes far too long to ever be usable, but again, we’re going to gloss over this.)

[10:01]

Drawing: A box labeled ‘A’, with an input wire with attached sensors, and an output wire with attached actuator. Note also that in the AIXI paper they define this thing, I don’t even know what they call it, something like AIXI_ti, which is a time and space bounded version of AIXI. We’re not going to deal with that, because this should still make clear what’s wrong with dualism. So this, the box A, is an approximate version of AIXI, approximate because you can’t actually compute these maxes. We imagine, when we design this algorithm, that we’re putting A in a box and calling everything else W, for World.

[11:00]

What we want is for A to learn to model W. Physics supplies values on in, and takes values on out. Normal physics operates outside of A; it tells you what voltages come on the “in” wire, given what happens in the sensors, and it also tells you what the actuators do, once you’ve supplied what happens on this output wire. (We’re applying this model where A actually models itself, given these processes- [as noted before, AIXI models future versions of itself to determine its likely actions, and therefore to determine the optimum action at current time].) So the question is, do you actually learn this model? If you have a bunch of observations, what sort of model of the world do you acquire?

[12:00]

This is obviously a very complicated model, because physics is pretty complicated, you know? But you still have to learn it [physics]. So let’s call this Model 1. Here’s model 2; Model 2 is just physics. In this model, physics supplies values on in, and that’s all. We were imagining that physics does something [in the world to supply an input], and the magic of A supplies the output. But we can now imagine that physics will apply on A, too, which is true, given a reasonable model of physics. So, this model produces the exact same output as Model 1, because both are in agreeement with reality, as long as the characterization of what A’s doing remains valid.

[13:14]

So, Models 1 and 2 make the same predictions. One observation is the Model 1 is a really odd model. The “takes values of output” is kind of a really arbitrary addendum. If you were a human trying to model the world, then you could be like “Well, my mind is behind a curtain, and there’s this magical, non-physical law where my mind directly controls what happens in my muscles.”

[13:59]

Or you could just get rid of that. And once you have a good enough understanding of physics, this [magical description of A] doesn’t have any explanatory value.