michaelcohen comments on Just Imitate Humans?

michaelcohen 29 Jul 2019 19:33 UTC
1 point
(Maybe we mean different things by that term.)
I think we did. I agree current methods scaled up could make mesa-optimizers. See my discussion with Wei Dai here for more of my take on this.
I’m not sure I understand the example
I wasn’t trying to suggest the answer to
Could it try to ensure that small changes to its “values” would be relatively inconsequential to its behavior?
was no. As you suggest, it seems like the answer is yes, but it would have to be very careful about this. FWIW, I think it would have more of a challenging preserving any inclination to eventually turn treacherous, but I’m mostly musing here.