A sufficiently strong world model can answer the question “What would a very smart very good person think about X?” and then you can just pipe that to the decision output, but that won’t get you higher intelligence than what was present in the training material.
Shouldn’t human goals have to be in the within human intelligence part, since humans have them? Or are we considering exactly human intelligence AI unsafe? Do you imagine a slightly dumber version of yourself failing to actualise your goals from not having good strategies, or failing to even embed them due to having a world model that lacks definitions of objects you care about?
Corrigibility has to be in the reachable part of the goals because a well-trained dog genuinely wants to do what you want it to do, even if it doesn’t always understand, and even if following the command will get it less food than otherwise and this is knowable to the dog. You clearly don’t need human intelligence to describe the terminal goal “Do what the humans want me to do”, although it’s not clear the goal will stay there as intelligence rises above human intelligence.
Yes, I would consider humans to already be unsafe, as we already made a sharp left turn that left us unaligned relative to our outer optimiser.
Dogs are a good point, thank you for that example. Not sure if dogs have our exact notion of corrigibility, but they definitely seem to be friendly in some relevant sence.
A sufficiently strong world model can answer the question “What would a very smart very good person think about X?” and then you can just pipe that to the decision output, but that won’t get you higher intelligence than what was present in the training material.
Shouldn’t human goals have to be in the within human intelligence part, since humans have them? Or are we considering exactly human intelligence AI unsafe? Do you imagine a slightly dumber version of yourself failing to actualise your goals from not having good strategies, or failing to even embed them due to having a world model that lacks definitions of objects you care about?
Corrigibility has to be in the reachable part of the goals because a well-trained dog genuinely wants to do what you want it to do, even if it doesn’t always understand, and even if following the command will get it less food than otherwise and this is knowable to the dog. You clearly don’t need human intelligence to describe the terminal goal “Do what the humans want me to do”, although it’s not clear the goal will stay there as intelligence rises above human intelligence.
Yes, I would consider humans to already be unsafe, as we already made a sharp left turn that left us unaligned relative to our outer optimiser.
Dogs are a good point, thank you for that example. Not sure if dogs have our exact notion of corrigibility, but they definitely seem to be friendly in some relevant sence.