To be clear, in this model, the sample trajectories will all be humans performing perfectly—ie only accelerations of +1 and −1, for every turn (expect maybe two turns).
Informally, what I expect the AI to do is think “hum, the goal I’m given should imply that accelerations of +3 and −3 give the optimal trajectories. Yet the trajectories I see do not use them. Since I know the whole model, I know that accelerations of +3 and −3 move me to PA=0. It seems likely that there’s a penalty for reaching that state, so I will avoid it.
Hum, but I also see that no-one is using accelerations of +2 and −2. Strange. This doesn’t cause the system to enter any new state. So there must be some intrinsic penalty in the real reward function for using such accelerations. Let’s try and estimate what it could be...”
OK, so the AI simply apes humans and doesn’t attempt to optimize anything?
Lets imagine something we actually might want to optimize.
lets imagine that the human pilots take 10 turns to run through a paper checklist at the start before any acceleration is applied. The turnover in the middle is 10 turns instead of 2 simply because the humans take time to manually work the controls and check everything.
Do we want our AI to decide that there simply must be some essential reason for this and sit still for no reason, aping humans without knowing why and without any real reason? It could do all checks in a single turn but doesn’t actually start moving for 10 turns because of artificial superstition?
And now that we’re into superstitions, how do we deal with spurious correlations?
If you give it a dataset of humans doing the same task many times with little variation but by pure chance whenever [some condition is met] the pilot takes one extra turn to start moving because he was picking his nose or just happened to take slightly longer does the AI make sure to always take 11 turns when those conditions are met?
Self driving cars which simply ape the car in front and don’t attempt to optimise anything already exist and are probably quite safe indeed but they’re not all that useful for coming up with better ways to do anything.
Yes. But it might not be as bad as the general AI problem. If we end up with an AI that believes that some biases are values, it will waste a lot of effort, but might not be as terrible as it could be. But that requires a bit more analysis than my vague comment, I’ll admit.
To be clear, in this model, the sample trajectories will all be humans performing perfectly—ie only accelerations of +1 and −1, for every turn (expect maybe two turns).
Informally, what I expect the AI to do is think “hum, the goal I’m given should imply that accelerations of +3 and −3 give the optimal trajectories. Yet the trajectories I see do not use them. Since I know the whole model, I know that accelerations of +3 and −3 move me to PA=0. It seems likely that there’s a penalty for reaching that state, so I will avoid it.
Hum, but I also see that no-one is using accelerations of +2 and −2. Strange. This doesn’t cause the system to enter any new state. So there must be some intrinsic penalty in the real reward function for using such accelerations. Let’s try and estimate what it could be...”
OK, so the AI simply apes humans and doesn’t attempt to optimize anything?
Lets imagine something we actually might want to optimize.
lets imagine that the human pilots take 10 turns to run through a paper checklist at the start before any acceleration is applied. The turnover in the middle is 10 turns instead of 2 simply because the humans take time to manually work the controls and check everything.
Do we want our AI to decide that there simply must be some essential reason for this and sit still for no reason, aping humans without knowing why and without any real reason? It could do all checks in a single turn but doesn’t actually start moving for 10 turns because of artificial superstition?
And now that we’re into superstitions, how do we deal with spurious correlations?
If you give it a dataset of humans doing the same task many times with little variation but by pure chance whenever [some condition is met] the pilot takes one extra turn to start moving because he was picking his nose or just happened to take slightly longer does the AI make sure to always take 11 turns when those conditions are met?
Self driving cars which simply ape the car in front and don’t attempt to optimise anything already exist and are probably quite safe indeed but they’re not all that useful for coming up with better ways to do anything.
That’s the next step, once we have the apping done correctly (ie distinguishing noise from performance).
That, actually, is a very very big problem.
Yes. But it might not be as bad as the general AI problem. If we end up with an AI that believes that some biases are values, it will waste a lot of effort, but might not be as terrible as it could be. But that requires a bit more analysis than my vague comment, I’ll admit.