Humans do lots of things that look like “changing their objective” [...]
That’s true but unless the AI is doing something like human imitation or metaphilosophy (in other words, we have some reason to think that the AI will converge to the “right” values), it seems dangerous to let it “changing their objective” on its own. Unless, I guess, it’s doing something like mild optimization or following norms, so that it can’t do much damage even if it switches to a wrong objective, and we can just shut it down and start over. But if it’s as messy as humans are, how would we know that it’s strictly following norms or doing mild optimization, and won’t “change its mind” about that too at some point (kind of like a human who isn’t very strategic suddenly has an insight or reads something on the Internet and decides to become strategic)?
I think overall I’m still confused about your perspective here. Do you think this kind of “messy” AI is something we should try to harness and turn into a safety success story (if so how), or do you think it’s a danger that we should try to avoid (which may for example have to involve global coordination because it might be more efficient than safer AIs that do have clean separation)?
Oh, going back to an earlier comment, I guess you’re suggesting some of each: try to harness at lower capability levels, and coordinate to avoid at higher capability levels.
In this entire comment thread I’m not arguing that mesa optimizers are safe, or proposing courses of action we should take to make mesa optimization safe. I’m simply trying to forecast what mesa optimizers will look like if we follow the default path. As I said earlier,
I’m not sure what happens in this regime, but it seems like it undercuts the mesa optimization story as told in this sequence.
It’s very plausible that the mesa optimizers I have in mind are even more dangerous, e.g. because they “change their objective”. It’s also plausible that they’re safer, e.g. because they are full-blown explicit EU maximizers and we can “convince” them to adopt goals similar to ours.
Mostly I’m saying these things because I think the picture presented in this sequence is not fully accurate, and I would like it to be more accurate. Having an accurate view of what problems will arise in the future tends to help with figuring out solutions to those problems.
That’s true but unless the AI is doing something like human imitation or metaphilosophy (in other words, we have some reason to think that the AI will converge to the “right” values), it seems dangerous to let it “changing their objective” on its own. Unless, I guess, it’s doing something like mild optimization or following norms, so that it can’t do much damage even if it switches to a wrong objective, and we can just shut it down and start over. But if it’s as messy as humans are, how would we know that it’s strictly following norms or doing mild optimization, and won’t “change its mind” about that too at some point (kind of like a human who isn’t very strategic suddenly has an insight or reads something on the Internet and decides to become strategic)?
I think overall I’m still confused about your perspective here. Do you think this kind of “messy” AI is something we should try to harness and turn into a safety success story (if so how), or do you think it’s a danger that we should try to avoid (which may for example have to involve global coordination because it might be more efficient than safer AIs that do have clean separation)?
Oh, going back to an earlier comment, I guess you’re suggesting some of each: try to harness at lower capability levels, and coordinate to avoid at higher capability levels.
In this entire comment thread I’m not arguing that mesa optimizers are safe, or proposing courses of action we should take to make mesa optimization safe. I’m simply trying to forecast what mesa optimizers will look like if we follow the default path. As I said earlier,
It’s very plausible that the mesa optimizers I have in mind are even more dangerous, e.g. because they “change their objective”. It’s also plausible that they’re safer, e.g. because they are full-blown explicit EU maximizers and we can “convince” them to adopt goals similar to ours.
Mostly I’m saying these things because I think the picture presented in this sequence is not fully accurate, and I would like it to be more accurate. Having an accurate view of what problems will arise in the future tends to help with figuring out solutions to those problems.