In this entire comment thread I’m not arguing that mesa optimizers are safe, or proposing courses of action we should take to make mesa optimization safe. I’m simply trying to forecast what mesa optimizers will look like if we follow the default path. As I said earlier,
I’m not sure what happens in this regime, but it seems like it undercuts the mesa optimization story as told in this sequence.
It’s very plausible that the mesa optimizers I have in mind are even more dangerous, e.g. because they “change their objective”. It’s also plausible that they’re safer, e.g. because they are full-blown explicit EU maximizers and we can “convince” them to adopt goals similar to ours.
Mostly I’m saying these things because I think the picture presented in this sequence is not fully accurate, and I would like it to be more accurate. Having an accurate view of what problems will arise in the future tends to help with figuring out solutions to those problems.
In this entire comment thread I’m not arguing that mesa optimizers are safe, or proposing courses of action we should take to make mesa optimization safe. I’m simply trying to forecast what mesa optimizers will look like if we follow the default path. As I said earlier,
It’s very plausible that the mesa optimizers I have in mind are even more dangerous, e.g. because they “change their objective”. It’s also plausible that they’re safer, e.g. because they are full-blown explicit EU maximizers and we can “convince” them to adopt goals similar to ours.
Mostly I’m saying these things because I think the picture presented in this sequence is not fully accurate, and I would like it to be more accurate. Having an accurate view of what problems will arise in the future tends to help with figuring out solutions to those problems.