The humans’ utility estimates will be wrong. And not “random noise” kind of wrong, but systematically and predictably wrong.
Applying lots of optimization pressure to the humans’ estimates will predictably Goodhart the wrongness of the estimates.
… also actions alone are not “good” or “bad”, tons and tons of context is relevant.
The hard problem:
What exactly is the “list of actions”?
Natural language description of actions? Then what is going to make the humans’ interpretation of those natural-language symbols accurately represent the things the AI actually does?
Examples of actions taken by an AI in a simulation? What is going to make anything learned from those examples generalize well to the physical world during deployment?
The relatively easy problems:
The humans’ utility estimates will be wrong. And not “random noise” kind of wrong, but systematically and predictably wrong.
Applying lots of optimization pressure to the humans’ estimates will predictably Goodhart the wrongness of the estimates.
… also actions alone are not “good” or “bad”, tons and tons of context is relevant.
The hard problem:
What exactly is the “list of actions”?
Natural language description of actions? Then what is going to make the humans’ interpretation of those natural-language symbols accurately represent the things the AI actually does?
Examples of actions taken by an AI in a simulation? What is going to make anything learned from those examples generalize well to the physical world during deployment?