It’s plausible to me that for tasks that we actually train on, we end up creating systems that are like mesa optimizers in the sense that they have broad capabilities that they can use on relatively new domains that they haven’t had much experience on before, but nonetheless because they aren’t made up of a two clean parts (mesa objective + capabilities) there isn’t a single obvious mesa objective that the AI system is optimizing for off distribution.
Coming back to this, can you give an example of the kind of thing you’re thinking of (in humans, animals, current ML systems)? Or other reason you think this could be the case in the future?
Also, do you think this will be significantly more efficient than “two clean parts (mesa objective + capabilities)”? (If not, it seems like we can use inner alignment techniques, e.g., transparency and verification, to force the model to be “two clean parts” if that’s better for safety.)
Coming back to this, can you give an example of the kind of thing you’re thinking of (in humans, animals, current ML systems)?
Humans don’t seem to have one mesa objective that we’re optimizing for. Even in this community, we tend to be uncertain about what our actual goal is, and most other people don’t even think about it. Humans do lots of things that look like “changing their objective”, e.g. maybe someone initially wants to have a family but then realizes they want to devote their life to public service because it’s more fulfilling.
Also, do you think this will be significantly more efficient than “two clean parts (mesa objective + capabilities)”?
I suspect it would be more efficient, but I’m not sure. (Mostly this is because humans and animals don’t seem to have two clean parts, but quite plausibly we’ll do something more interpretable than evolution and that will push towards a clean separation.) I also don’t know whether it would be better for safety to have it split into two clean parts.
Humans do lots of things that look like “changing their objective” [...]
That’s true but unless the AI is doing something like human imitation or metaphilosophy (in other words, we have some reason to think that the AI will converge to the “right” values), it seems dangerous to let it “changing their objective” on its own. Unless, I guess, it’s doing something like mild optimization or following norms, so that it can’t do much damage even if it switches to a wrong objective, and we can just shut it down and start over. But if it’s as messy as humans are, how would we know that it’s strictly following norms or doing mild optimization, and won’t “change its mind” about that too at some point (kind of like a human who isn’t very strategic suddenly has an insight or reads something on the Internet and decides to become strategic)?
I think overall I’m still confused about your perspective here. Do you think this kind of “messy” AI is something we should try to harness and turn into a safety success story (if so how), or do you think it’s a danger that we should try to avoid (which may for example have to involve global coordination because it might be more efficient than safer AIs that do have clean separation)?
Oh, going back to an earlier comment, I guess you’re suggesting some of each: try to harness at lower capability levels, and coordinate to avoid at higher capability levels.
In this entire comment thread I’m not arguing that mesa optimizers are safe, or proposing courses of action we should take to make mesa optimization safe. I’m simply trying to forecast what mesa optimizers will look like if we follow the default path. As I said earlier,
I’m not sure what happens in this regime, but it seems like it undercuts the mesa optimization story as told in this sequence.
It’s very plausible that the mesa optimizers I have in mind are even more dangerous, e.g. because they “change their objective”. It’s also plausible that they’re safer, e.g. because they are full-blown explicit EU maximizers and we can “convince” them to adopt goals similar to ours.
Mostly I’m saying these things because I think the picture presented in this sequence is not fully accurate, and I would like it to be more accurate. Having an accurate view of what problems will arise in the future tends to help with figuring out solutions to those problems.
Coming back to this, can you give an example of the kind of thing you’re thinking of (in humans, animals, current ML systems)? Or other reason you think this could be the case in the future?
Also, do you think this will be significantly more efficient than “two clean parts (mesa objective + capabilities)”? (If not, it seems like we can use inner alignment techniques, e.g., transparency and verification, to force the model to be “two clean parts” if that’s better for safety.)
Humans don’t seem to have one mesa objective that we’re optimizing for. Even in this community, we tend to be uncertain about what our actual goal is, and most other people don’t even think about it. Humans do lots of things that look like “changing their objective”, e.g. maybe someone initially wants to have a family but then realizes they want to devote their life to public service because it’s more fulfilling.
I suspect it would be more efficient, but I’m not sure. (Mostly this is because humans and animals don’t seem to have two clean parts, but quite plausibly we’ll do something more interpretable than evolution and that will push towards a clean separation.) I also don’t know whether it would be better for safety to have it split into two clean parts.
That’s true but unless the AI is doing something like human imitation or metaphilosophy (in other words, we have some reason to think that the AI will converge to the “right” values), it seems dangerous to let it “changing their objective” on its own. Unless, I guess, it’s doing something like mild optimization or following norms, so that it can’t do much damage even if it switches to a wrong objective, and we can just shut it down and start over. But if it’s as messy as humans are, how would we know that it’s strictly following norms or doing mild optimization, and won’t “change its mind” about that too at some point (kind of like a human who isn’t very strategic suddenly has an insight or reads something on the Internet and decides to become strategic)?
I think overall I’m still confused about your perspective here. Do you think this kind of “messy” AI is something we should try to harness and turn into a safety success story (if so how), or do you think it’s a danger that we should try to avoid (which may for example have to involve global coordination because it might be more efficient than safer AIs that do have clean separation)?
Oh, going back to an earlier comment, I guess you’re suggesting some of each: try to harness at lower capability levels, and coordinate to avoid at higher capability levels.
In this entire comment thread I’m not arguing that mesa optimizers are safe, or proposing courses of action we should take to make mesa optimization safe. I’m simply trying to forecast what mesa optimizers will look like if we follow the default path. As I said earlier,
It’s very plausible that the mesa optimizers I have in mind are even more dangerous, e.g. because they “change their objective”. It’s also plausible that they’re safer, e.g. because they are full-blown explicit EU maximizers and we can “convince” them to adopt goals similar to ours.
Mostly I’m saying these things because I think the picture presented in this sequence is not fully accurate, and I would like it to be more accurate. Having an accurate view of what problems will arise in the future tends to help with figuring out solutions to those problems.