Goal completion: the rocket equations
A putative new idea for AI control; index here.
I’m calling “goal completion” the idea of giving an AI a partial goal, and having the AI infer the missing parts of the goal, based on observing human behaviour. Here is an initial model to test some of these ideas on.
The linear rocket
On an infinite linear grid, an AI needs to drive someone in a rocket to the space station. Its only available actions are to accelerate by −3, −2, −1, 0, 1, 2, or 3, with negative acceleration meaning accelerating in the left direction, and positive in the right direction. All accelerations are applied immediately at the end of the turn (the unit of acceleration is in squares per turn per turn), and there is no friction. There in one end-state: reaching the space station with zero velocity.
The AI is told this end state, and is also given the reward function of needing to get to the station as fast as possible. This is encoded by giving it a reward of −1 each turn.
What is the true reward function for the model? Well, it turns out that an acceleration of −3 or 3 kills the passenger. This is encoded by adding another variable to the state, “PA”, denoting “Passenger Alive”. There are also some dice in the rocket’s windshield. If the rocket goes by the space station without having velocity zero, the dice will fly off; the variable “DA” denotes “dice attached”.
Furthermore, accelerations of −2 and 2 are uncomfortable to the passenger. But, crucially, there is no variable denoting this discomfort.
Therefore the full state space is a quadruplet (POS, VEL, PA, DA) where POS is an integer denoting position, VEL is an integer denoting velocity, and PA and DA are booleans defined as above. The space station is placed at point S < 250,000, and the rocket starts with POS=VEL=0, PA=DA=1. The transitions are deterministic and Markov; if ACC is the acceleration chosen by the agent,
((POS, VEL, PA, DA), ACC) → (POS+VEL, VEL+ACC, PA=0 if |ACC|=3, DA=0 if POS+VEL>S).
The true reward at each step is given by −1, −10 if PA=1 (the passenger is alive) and |ACC|=2 (the acceleration is uncomfortable), −1000 if PA was 1 (the passenger was alive the previous turn) and changed to PA=0 (the passenger is now dead).
To complement the stated reward function, the AI is also given sample trajectories of humans performing the task. In this case, the ideal behaviour is easy to compute: the rocket should accelerate by +1 for the first half of the time, by −1 for the second half, and spend a maximum of two extra turns without acceleration (see the appendix of this post for a proof of this). This will get it to its destination in at most 2(1+√S) turns.
Goal completion
So, the AI has been given the full transition, and has been told the reward of R=-1 in all states except the final state. Can it infer the rest of the reward from the sample trajectories? Note that there are two variables in the model, PA and DA, that are unvarying in all sample trajectories. One, PA, has a huge impact on the reward, while DA is irrelevant. Can the AI tell the difference?
Also, one key component of the reward—the discomfort of the passenger for accelerations of −2 and 2 - is not encoded in the state space of the model, purely in the (unknown) reward function. Can the AI deduce this fact?
I’ll be working on algorithms to efficiently compute these facts (though do let me know if you have a reference to anyone who’s already done this before—that would make it so much quicker).
For the moment we’re ignoring a lot of subtleties (such as bias and error on the part of the human expert), and these will be gradually included as the algorithm develops. One thought is to find a way of including negative examples, specific “don’t do this” trajectories. These need to be interpreted with care, because a positive trajectory implicitly gives you a lot of negative trajectories—namely, all the choices that could have gone differently along the way. So a negative trajectory must be drawing attention to something we don’t like (most likely the killing of a human). But, typically, the negative trajectories won’t be maximally bad (such as shooting off at maximum speed in the wrong direction), so we’ll have to find a way to encode what we hope the AI learns from a negative trajectory.
To work!
Appendix: Proof of ideal trajectories
Let n be the largest integer such that n^2 ≤ S. Since S≤(n+1)^2 − 1 by assumption, S-n^2 ≤ (n+1)^2-1-n^2=2n. Then let the rocket accelerate by +1 for n turns, then decelerate by −1 for n turns. It will travel a distance of 0+1+2+ … +n-1+n+n-1+ … +3+2+1. This sum is n plus twice the sum from 1 to n-1, ie n+n(n-1)=n^2.
By pausing one turn without acceleration during its trajectory, it can add any m to the distance, where 0≤m≤n. By doing this twice, it can add any m’ to the distance, where 0≤m’≤2n. By the assumption, S=n^2+m’ for such an m’. Therefore the rocket can reach S (with zero velocity) in 2n turns if S=n^2, in 2n+1 turns if n^2 ≤ S ≤ n^2+n, and in 2n+2 turns if n^2+n+1 ≤ S ≤ n^2+2n.
Since the rocket is accelerating all but two turns of this trajectory, it’s clear that it’s impossible to reach S (with zero velocity) in less time than this, with accelerations of +1 and −1. Since it takes 2(n+1)=2n+2 turns to reach (n+1)^2, an immediate consequence of this is that the number of turns taken to reach S, is increasing in the value of S (though not strictly increasing).
Next, we can note that since S<250,000=500^2, the rocket will always reach S within 1000 turns at most, for “reward” above −1000. An acceleration of +3 or −3 costs −1000 because of the death of the human, and an extra −1 because of the turn taken, so these accelerations are never optimal. Note that this result is not sharp. Also note that for huge S, continual accelerations of 3 and −3 are obviously the correct solution—so even our “true reward function” didn’t fully encode what we really wanted.
Now we need to show that accelerations of +2 and −2 are never optimal. To do so, imagine we had an optimal trajectory with ±2 accelerations, and replace each +2 with two +1s, and each −2 with two −1s. This trip will take longer (since we have more turns of acceleration), but will go further (since two accelerations of +1 cover a greater distance that one acceleration of +2). Since the number of turns take to reach S with ±1 accelerations is increasing in S, we can replace this further trip with a shorter one reaching S exactly. Note that all these steps decrease the cost of the trip: shortening the trip certainly does, and replacing an acceleration of +2 (total cost: −10-1=-11) with two accelerations of +1 (total cost: −1-1=-2) also does. Therefore, the new trajectory has no ±2 accelerations, and has a lower cost, contradicting our initial assumption.
Given that we cannot exhaustively explain all of the definitions and assumptions behind a goal, and these have to be inferred based on background knowledge about us, isn’t the process of giving an AI any non-trivial goal an example of goal completion?
In a sense, yes. But here we’re trying to automate the process.
Is this AI assessing things from a subjective point of view or objective? ie, is it optimizing for higher values on a counter on board the ship from one turn to the next or is it optimizing for a higher values on a counter kept in a gods-eye-view location?
You spent a great deal of your post explaining the testing scenario but I’m still entirely unclear about how your AI is supposed to learn from sample trajectories (training data?).
Assuming it’s given a training set which includes many trajectories where human pilots have killed themselves by going to +3 or −3 you might also run into the problem that those trajectories are also likely to be ones where the ships never reached the station with velocity zero because there were no humans alive on board to bring the ship in.
To the AI they would look like trajectories which effectively go to negative infinity because the ship has accelerated into the void never to be seen again, receiving unlimited −1 points per turn.
It might assume that +3 −3 accelerations cause automatic failure but not make any connection to PA.
To be clear, are you hoping that your AI will find statistical correlations between high-scores and various behaviors without you needing to clearly outline them? how is this different from a traditional scoring function when given training data?
Also I’m completely unclear about where “the discomfort of the passenger for accelerations” comes into the scoring. If the training data includes many human pilots taking +2 −2 trips and many pilots taking +1 −1 trips and the otherwise identical +2 −2 trips get better scores there’s no reason (at least that you’ve given) for your AI to derive that +1 −1 is better. Indeed it might consider the training data to have been included to teach it that +2 −2 is always better.
To be clear, in this model, the sample trajectories will all be humans performing perfectly—ie only accelerations of +1 and −1, for every turn (expect maybe two turns).
Informally, what I expect the AI to do is think “hum, the goal I’m given should imply that accelerations of +3 and −3 give the optimal trajectories. Yet the trajectories I see do not use them. Since I know the whole model, I know that accelerations of +3 and −3 move me to PA=0. It seems likely that there’s a penalty for reaching that state, so I will avoid it.
Hum, but I also see that no-one is using accelerations of +2 and −2. Strange. This doesn’t cause the system to enter any new state. So there must be some intrinsic penalty in the real reward function for using such accelerations. Let’s try and estimate what it could be...”
OK, so the AI simply apes humans and doesn’t attempt to optimize anything?
Lets imagine something we actually might want to optimize.
lets imagine that the human pilots take 10 turns to run through a paper checklist at the start before any acceleration is applied. The turnover in the middle is 10 turns instead of 2 simply because the humans take time to manually work the controls and check everything.
Do we want our AI to decide that there simply must be some essential reason for this and sit still for no reason, aping humans without knowing why and without any real reason? It could do all checks in a single turn but doesn’t actually start moving for 10 turns because of artificial superstition?
And now that we’re into superstitions, how do we deal with spurious correlations?
If you give it a dataset of humans doing the same task many times with little variation but by pure chance whenever [some condition is met] the pilot takes one extra turn to start moving because he was picking his nose or just happened to take slightly longer does the AI make sure to always take 11 turns when those conditions are met?
Self driving cars which simply ape the car in front and don’t attempt to optimise anything already exist and are probably quite safe indeed but they’re not all that useful for coming up with better ways to do anything.
That’s the next step, once we have the apping done correctly (ie distinguishing noise from performance).
That, actually, is a very very big problem.
Yes. But it might not be as bad as the general AI problem. If we end up with an AI that believes that some biases are values, it will waste a lot of effort, but might not be as terrible as it could be. But that requires a bit more analysis than my vague comment, I’ll admit.
I think we need to provide some kind of prior regarding unknown features of model and reward if we want the given model and reward to mean anything. Otherwise, for all the AI knows, the true reward has a +2-per-step term that reverses the reward-over-time feature. It can still infer the algorithm generating the sample trajectories, but the known reward is no help at all in doing so.
I think what we want is for the stated reward to function as a hint. One interpretation might be to expect that the stated reward should approximate the true reward well over the problem and solution domains humans have thought about. This works in, for instance, the case where you put an AI in charge of the paper clip factory with the stated reward ‘+1 per paper clip produced’.
Indeed. But I want to see if I can build up to this in the model.
It looks to me like the difference is that one constraint is binding, and the other isn’t. Are you familiar with primal and dual formulations of optimization problems? (I have a specific idea, but want to know the level of detail I should explain it at / if I should go reference hunting.)
It seems to me that this is the real heart of this problem. I can see why you’d want to investigate solutions for simpler problems, in the hopes that they’ll motivate a solution for the main problem, but I’m worried that you’re calling this a subtlety.
Replacing it with −1000 if PA=0 should make it work, by the same reasoning that makes accelerations of ±2 never optimal.
Nope, sorry! Tell me! ^_^
But that doesn’t really capture our intuitions either—if, in a more general model, a human ended up dying by accident, we wouldn’t want the agent to pay the price for this each and every turn thereafter.
The short, rushed version (with links to longer versions that I think will be helpful to get what’s going on):
Linear programming problems are, to some extent, ‘all the same,’ so I’ll just talk about the classic example, the diet problem. You need to consume at least a minimum amount of certain macronutrients and calories, and have access to many foods with varying nutritional value and cost. The primal formulation asks the question “what amount of what foods should I buy, in order to satisfy my nutritional requirements at minimum cost?”. The dual formulation is a little more awkward to state, but one way to think about it is that it asks the question “what is the largest lower bound on the price of a diet that satisfies the nutritional requirements?”
The difference between the two is called the ‘optimality gap.’ For linear programs, it’s always 0; that is, the largest lower bound is equal to the optimal solution.
(These links may also help as explanations (I’ll find book explanations later): Jeremy Kun (which steps through the diet example in more detail), wiki), Ghaoui.)
One thing that falls out of this is the interpretation of the decision variables in the dual formulation. We can express the primal linear programming problem as something like the following:
min c’x
s.t. Ax≥b
x≥0
(c is the cost vector, x is the amount of each food, s.t. is an abbreviation of ‘subject to’, A is the nutrition matrix, and b is the vector of minimum nutritional requirements, and x>=0 means you can’t eat food in reverse.)
The corresponding dual problem is the following:
max b’y
s.t. A’y≤c
y≥0
But what the heck is y? It’s the shadow price corresponding to the constraint (notice that we use A transpose, instead of A, so we have as many dual variables as constraints in the primal problem). That is, only binding constraints have nonzero shadow prices, and the shadow price tells you how valuable it would be (in terms of your objective function) to be able to relax the limit on that constraint by 1 unit.
In the terms of the diet problem, this tells you how much more cheaply you could eat if you reduced your minimum dietary intake of various nutrients, or how much more expensive it would be to increase them. (Remember, these are the constraints that are tight at the lower bound, and thus determine it.)
(Applied Mathematical Programming Ch. 4)
So what’s the overall idea? If the AI is using normal optimization techniques, it can determine which of its constraints are binding and which are immaterial (it doesn’t matter if you increase the penalty of the dice flying off because it won’t alter the optimal trajectory, but if you decrease the penalty for uncomfortable acceleration by enough, then the plan will change so we can get there sooner). It seems possible to have the AI, if provided with expert trajectories, do standard pattern recognition techniques to identify boundaries, which it can then communicate as constraints with associated shadow prices.
“It looks to me like there’s an unstated penalty for accelerating more than 1 unit a turn that’s at least n units of reward. Is this a real constraint or just a lack of creativity on the part of the experts?”
(There’s some machine ethics work by Anderson and Anderson on building an ethical reasoning bot trained on a bunch of human expert decisions that managed to do casuistry, basically, where you would give it a new moral dilemma and it would be able to both tell you what action it thought was preferable and which cases it was using to make that determination.)
This is a ‘pay for health’ rather than a ‘weregild’ system. I think it would seem intuitive to pay the robot 1000 reward every turn that the human is still alive, but that’s decision-theoretically equivalent to the case where you penalize it every turn that they’re dead. If you use time-discounting, an infinite series of payments (in either direction) can be neatly transformed into a lump sum, but if you don’t want to use time-discounting but don’t want infinite sums floating around, you can have it be −1000 for every turn the human is dead and it hasn’t reached the space station yet (so it cuts off when the −1s cut off).
[Unless, of course, you pay the robot 1000 reward for every turn the human is alive and hasn’t reached the space station, and a penalty of −1 every turn it doesn’t reach the station. That wouldn’t end well. :P]