(Looking back on this, I’m now confused why Rohin doesn’t think mesa-optimisers wouldn’t end up being approximately optimal for some objective/utility function)
I predict that Rohin would say something like “the phrase ‘approximately optimal for some objective/utility function’ is basically meaningless in this context, because for any behaviour, there’s some function which it’s maximising”.
You might then limit yourself to the set of functions that defines tasks that are interesting or relevant to humans. But then that includes a whole bunch of functions which define safe bounded behaviour as well as a whole bunch which define unsafe unbounded behaviour, and we’re back to being very uncertain about which case we’ll end up in.
That would probably be part of my response, but I think I’m also considering a different argument.
The thing that I was arguing against was “(c): agents that we build are optimizing some objective function”. This is importantly different from “mesa-optimisers [would] end up being approximately optimal for some objective/utility function” when you consider distributional shift.
It seems plausible that the agent could look like it is “trying to achieve” some simple utility function, and perhaps it would even be approximately optimal for that simple utility function on the training distribution. (Simple here is standing in for “isn’t one of the weird meaningless utility functions in Coherence arguments do not imply goal-directed behavior, and looks more like ‘maximize happiness’ or something like that”.) But if you then take this agent and place it in a different distribution, it wouldn’t do all the things that an EU maximizer with that utility function would do, it might only do some of the things, because it isn’t internally structured as a search process for sequences of actions that lead to high utility behavior, it is instead structured as a bunch of heuristics that were selected for high utility on the training environment that may or may not work well in the new setting.
(In my head, the Partial Agency sequence is meandering towards this conclusion, though I don’t think that’s actually true.)
(I think people have overupdated on “what Rohin believes” from the coherence arguments post—I do think that powerful AI systems will be agent-ish, and EU maximizer-ish, I just don’t think that it is going to be a 100% EU maximizer that chooses actions by considering reasonable sequences of actions and doing the one with the best predicted consequences. With that post, I was primarily arguing against the position that EU maximization is required by math.)
(Looking back on this, I’m now confused why Rohin doesn’t think mesa-optimisers wouldn’t end up being approximately optimal for some objective/utility function)
I predict that Rohin would say something like “the phrase ‘approximately optimal for some objective/utility function’ is basically meaningless in this context, because for any behaviour, there’s some function which it’s maximising”.
You might then limit yourself to the set of functions that defines tasks that are interesting or relevant to humans. But then that includes a whole bunch of functions which define safe bounded behaviour as well as a whole bunch which define unsafe unbounded behaviour, and we’re back to being very uncertain about which case we’ll end up in.
That would probably be part of my response, but I think I’m also considering a different argument.
The thing that I was arguing against was “(c): agents that we build are optimizing some objective function”. This is importantly different from “mesa-optimisers [would] end up being approximately optimal for some objective/utility function” when you consider distributional shift.
It seems plausible that the agent could look like it is “trying to achieve” some simple utility function, and perhaps it would even be approximately optimal for that simple utility function on the training distribution. (Simple here is standing in for “isn’t one of the weird meaningless utility functions in Coherence arguments do not imply goal-directed behavior, and looks more like ‘maximize happiness’ or something like that”.) But if you then take this agent and place it in a different distribution, it wouldn’t do all the things that an EU maximizer with that utility function would do, it might only do some of the things, because it isn’t internally structured as a search process for sequences of actions that lead to high utility behavior, it is instead structured as a bunch of heuristics that were selected for high utility on the training environment that may or may not work well in the new setting.
(In my head, the Partial Agency sequence is meandering towards this conclusion, though I don’t think that’s actually true.)
(I think people have overupdated on “what Rohin believes” from the coherence arguments post—I do think that powerful AI systems will be agent-ish, and EU maximizer-ish, I just don’t think that it is going to be a 100% EU maximizer that chooses actions by considering reasonable sequences of actions and doing the one with the best predicted consequences. With that post, I was primarily arguing against the position that EU maximization is required by math.)