I intentionally left out the details of “what do we do with it” because it’s conceptually orthogonal to goal agnosticism and is a huge topic of its own. It comes down to the class of solutions enabled by having extreme capability that you can actually use without it immediately backfiring.
For example, I think this has a real shot at leading to a strong and intuitively corrigible system. I say “intuitively” here because the corrigibility doesn’t arise from a concise mathematical statement that solves the original formulation. Instead, it lets us aim it at an incredibly broad and complex indirect specification that gets us all the human messiness we want.
I intentionally left out the details of “what do we do with it” because it’s conceptually orthogonal to goal agnosticism and is a huge topic of its own. It comes down to the class of solutions enabled by having extreme capability that you can actually use without it immediately backfiring.
For example, I think this has a real shot at leading to a strong and intuitively corrigible system. I say “intuitively” here because the corrigibility doesn’t arise from a concise mathematical statement that solves the original formulation. Instead, it lets us aim it at an incredibly broad and complex indirect specification that gets us all the human messiness we want.