If the information won’t fit into human ways of understanding the world, then we can’t name what it is we’re missing. This always makes examples of “inaccessible information” feel weird to me—like we’ve cheated by even naming the thing we want as if it’s somewhere in the computer, and instead our first step should be to design a system that at all represents the thing we want.
I think world model mismatches are possibly unavoidable with prosaic AGI, which might reasonably bias one against this AGI pathway. It seems possible that much of human and AGI world models would be similar by default if ‘tasks humans are optimised for’ is a similar set to ‘tasks AGI is optimised for’ and compute is not a performance-limiting factor, but I’m not at all confident that this is likely (e.g. maybe an AGI draws coarser- or finer-grained symbolic Markov blankets). Even if we build systems that represent the things we want and the things we do to get them as distinct symbolic entities in the same way humans do, they might fail to be competitive with systems that build their world models in an alien way (e.g. draw Markov blankets around symbolic entities that humans cannot factor into their world model due to processing or domain-specific constraints).
Depending on how one thinks AGI development will happen (e.g. is the strategy stealing assumption important) resolving world model mismatches seems more or less a priority for alignment. If near-term performance competitiveness heavily influences deployment, I think it’s reasonably likely that prosaic AGI is prioritised and world model mismatches occur by default because, for example, compute is likely a performance-limiting factor for humans on tasks we optimise AGI for, or the symbolic entities humans use are otherwise nonuniversal. I think AGI might generally require incorporating alien features into world models to be maximally competitive, but I’m very new to this field.
Yup, this all sounds good to me. I think the trick is not to avoid alien concepts, but to make your alignment scheme also learn ways of representing the world that are close enough to how we want to the world to be modeled.
I think I agree. To the extent that a ‘world model’ is an appropriate abstraction, I think the levers to pull for resolving world model mismatches seem to be:
Post-facto: train an already capable (prosaic?) AI to explain itself in a way that accounts for world model mismatches via a clever training mechanism and hope that only accessible consequences matter for preserving human option value; or
Ex-ante: build AI systems in an architecturally transparent manner such that properties of their world model can be inspected and tuned, and hope that the training process makes these AI systems competitive.
I think you are advocating for the latter, or have I misrepresented the levers?
Maybe I don’t see a bright line between these things. Adding an “explaining module” to an existing AI and then doing more training is not so different from designing an AI that has an “explaining module” from the start. And training an AI with an “explaining module” isn’t so different from training an AI with a “making sure internal states are somewhat interpretable” module.
I’m probably advocating something close to “Ex-ante,” but with lots of learning, including learning that informs the AI what features of the world we want it to make interpretable to us.
If the information won’t fit into human ways of understanding the world, then we can’t name what it is we’re missing. This always makes examples of “inaccessible information” feel weird to me—like we’ve cheated by even naming the thing we want as if it’s somewhere in the computer, and instead our first step should be to design a system that at all represents the thing we want.
I think world model mismatches are possibly unavoidable with prosaic AGI, which might reasonably bias one against this AGI pathway. It seems possible that much of human and AGI world models would be similar by default if ‘tasks humans are optimised for’ is a similar set to ‘tasks AGI is optimised for’ and compute is not a performance-limiting factor, but I’m not at all confident that this is likely (e.g. maybe an AGI draws coarser- or finer-grained symbolic Markov blankets). Even if we build systems that represent the things we want and the things we do to get them as distinct symbolic entities in the same way humans do, they might fail to be competitive with systems that build their world models in an alien way (e.g. draw Markov blankets around symbolic entities that humans cannot factor into their world model due to processing or domain-specific constraints).
Depending on how one thinks AGI development will happen (e.g. is the strategy stealing assumption important) resolving world model mismatches seems more or less a priority for alignment. If near-term performance competitiveness heavily influences deployment, I think it’s reasonably likely that prosaic AGI is prioritised and world model mismatches occur by default because, for example, compute is likely a performance-limiting factor for humans on tasks we optimise AGI for, or the symbolic entities humans use are otherwise nonuniversal. I think AGI might generally require incorporating alien features into world models to be maximally competitive, but I’m very new to this field.
Yup, this all sounds good to me. I think the trick is not to avoid alien concepts, but to make your alignment scheme also learn ways of representing the world that are close enough to how we want to the world to be modeled.
I think I agree. To the extent that a ‘world model’ is an appropriate abstraction, I think the levers to pull for resolving world model mismatches seem to be:
Post-facto: train an already capable (prosaic?) AI to explain itself in a way that accounts for world model mismatches via a clever training mechanism and hope that only accessible consequences matter for preserving human option value; or
Ex-ante: build AI systems in an architecturally transparent manner such that properties of their world model can be inspected and tuned, and hope that the training process makes these AI systems competitive.
I think you are advocating for the latter, or have I misrepresented the levers?
Maybe I don’t see a bright line between these things. Adding an “explaining module” to an existing AI and then doing more training is not so different from designing an AI that has an “explaining module” from the start. And training an AI with an “explaining module” isn’t so different from training an AI with a “making sure internal states are somewhat interpretable” module.
I’m probably advocating something close to “Ex-ante,” but with lots of learning, including learning that informs the AI what features of the world we want it to make interpretable to us.