Rohin Shah comments on AI Alignment 2018-19 Review

Rohin Shah 29 Jan 2020 1:34 UTC
LW: 6 AF: 4
AF
The long version has a clearer articulation of this point:
For an agent to outperform the process generating its data, it must understand the ways in which that process makes mistakes. So, to outperform humans at a task given only human demonstrations of that task, you need to detect human mistakes in the demonstrations.
So yes, you can achieve performance comparable to that of a human (with both IRL and supervised learning); the hard part is in outperforming the human.
Supervised learning could be used to learn a reward function that evaluates states as well as a human would evaluate states; it is possible that an agent trained on such a reward function could outperform humans at actually creating good states (this would happen if humans were better at evaluating states than at creating good states, which seems plausible).
However, I haven’t seen a clear story for why the opposite mistake, of attributing stuff to the values part which actually belongs to the planner, would cause a catastrophe.
This is the default outcome of IRL; here IRL reduces to imitating a human. If you look at the posts that argue that value learning is hard, they all implicitly or explicitly agree with this point; they’re more concerned with how you get to superhuman performance (presumably because there will be competitive pressure to build superhuman AI systems). It is controversial whether imitating humans is safe (see the Human Models section).
You could make a speed superintelligence which basically values behaving as much like the humans it has observed as possible.
Yeah, the iterated amplification agenda depends on (among other things) a similar hope that it is sufficient to train an AI system that quickly approximates the result of a human thinking for a long time.