I think we did. I agree current methods scaled up could make mesa-optimizers. See my discussion with Wei Dai here for more of my take on this.
I’m not sure I understand the example
I wasn’t trying to suggest the answer to
Could it try to ensure that small changes to its “values” would be relatively inconsequential to its behavior?
was no. As you suggest, it seems like the answer is yes, but it would have to be very careful about this. FWIW, I think it would have more of a challenging preserving any inclination to eventually turn treacherous, but I’m mostly musing here.
I think we did. I agree current methods scaled up could make mesa-optimizers. See my discussion with Wei Dai here for more of my take on this.
I wasn’t trying to suggest the answer to
was no. As you suggest, it seems like the answer is yes, but it would have to be very careful about this. FWIW, I think it would have more of a challenging preserving any inclination to eventually turn treacherous, but I’m mostly musing here.