I feel confused about the failure story from example 3. (First 3 bullet-points in that section.)
It sounded like: We ask for a human-comprehensible way to predict X; the computer uses a very low-level simulation plus a small bridge that predicts only and exactly X; humans can’t use the model to predict any high-level facts besides X.
But I don’t see how that leads to egregious misalignment. Shouldn’t the humans be able to notice their inability to predict high-level things they care about and send the AI back to its model-search phase? (As opposed to proceeding to evaluate policies based on this model and being tricked into a policy that fails “off-screen” somewhere.)
I feel confused about the failure story from example 3. (First 3 bullet-points in that section.)
It sounded like: We ask for a human-comprehensible way to predict X; the computer uses a very low-level simulation plus a small bridge that predicts only and exactly X; humans can’t use the model to predict any high-level facts besides X.
But I don’t see how that leads to egregious misalignment. Shouldn’t the humans be able to notice their inability to predict high-level things they care about and send the AI back to its model-search phase? (As opposed to proceeding to evaluate policies based on this model and being tricked into a policy that fails “off-screen” somewhere.)