Yup, this all sounds good to me. I think the trick is not to avoid alien concepts, but to make your alignment scheme also learn ways of representing the world that are close enough to how we want to the world to be modeled.
I think I agree. To the extent that a ‘world model’ is an appropriate abstraction, I think the levers to pull for resolving world model mismatches seem to be:
Post-facto: train an already capable (prosaic?) AI to explain itself in a way that accounts for world model mismatches via a clever training mechanism and hope that only accessible consequences matter for preserving human option value; or
Ex-ante: build AI systems in an architecturally transparent manner such that properties of their world model can be inspected and tuned, and hope that the training process makes these AI systems competitive.
I think you are advocating for the latter, or have I misrepresented the levers?
Maybe I don’t see a bright line between these things. Adding an “explaining module” to an existing AI and then doing more training is not so different from designing an AI that has an “explaining module” from the start. And training an AI with an “explaining module” isn’t so different from training an AI with a “making sure internal states are somewhat interpretable” module.
I’m probably advocating something close to “Ex-ante,” but with lots of learning, including learning that informs the AI what features of the world we want it to make interpretable to us.
Yup, this all sounds good to me. I think the trick is not to avoid alien concepts, but to make your alignment scheme also learn ways of representing the world that are close enough to how we want to the world to be modeled.
I think I agree. To the extent that a ‘world model’ is an appropriate abstraction, I think the levers to pull for resolving world model mismatches seem to be:
Post-facto: train an already capable (prosaic?) AI to explain itself in a way that accounts for world model mismatches via a clever training mechanism and hope that only accessible consequences matter for preserving human option value; or
Ex-ante: build AI systems in an architecturally transparent manner such that properties of their world model can be inspected and tuned, and hope that the training process makes these AI systems competitive.
I think you are advocating for the latter, or have I misrepresented the levers?
Maybe I don’t see a bright line between these things. Adding an “explaining module” to an existing AI and then doing more training is not so different from designing an AI that has an “explaining module” from the start. And training an AI with an “explaining module” isn’t so different from training an AI with a “making sure internal states are somewhat interpretable” module.
I’m probably advocating something close to “Ex-ante,” but with lots of learning, including learning that informs the AI what features of the world we want it to make interpretable to us.