[Eli’s personal notes. Feel free to ignore or engage.]
We suggest that an important factor in the answer to this question is whether the AGI system was built using human modelling or not. If it produced a solution to the transit design problem (that humans approve of) without human modelling, then we would more readily trust its outputs. If it produced a solution we approve of with human modelling, then although we expect the outputs to be in many ways about good transit system design (our actual preferences) and in many ways suited to being approved by humans, to the extent that these two targets come apart we must worry about having overfit to the human model at the expense of the good design. (Why not the other way around? Because our assessment of the sandboxed results uses human judgement, not an independent metric for satisfaction of our actual preferences.)
Short summary: If an AI system is only modeling the problem that we want it to solve, and it produces a solution that looks good to us, we can be pretty confident that it it is actually a good solution.
Whereas, if it is modeling some problem, and modeling us, we can’t be sure where the solution lies on the spectrum of “actually good” solutions vs. “bad solutions that appear good to us.”
[Eli’s personal notes. Feel free to ignore or engage.]
Short summary: If an AI system is only modeling the problem that we want it to solve, and it produces a solution that looks good to us, we can be pretty confident that it it is actually a good solution.
Whereas, if it is modeling some problem, and modeling us, we can’t be sure where the solution lies on the spectrum of “actually good” solutions vs. “bad solutions that appear good to us.”
Great summary!