That would probably be part of my response, but I think I’m also considering a different argument.
The thing that I was arguing against was “(c): agents that we build are optimizing some objective function”. This is importantly different from “mesa-optimisers [would] end up being approximately optimal for some objective/utility function” when you consider distributional shift.
It seems plausible that the agent could look like it is “trying to achieve” some simple utility function, and perhaps it would even be approximately optimal for that simple utility function on the training distribution. (Simple here is standing in for “isn’t one of the weird meaningless utility functions in Coherence arguments do not imply goal-directed behavior, and looks more like ‘maximize happiness’ or something like that”.) But if you then take this agent and place it in a different distribution, it wouldn’t do all the things that an EU maximizer with that utility function would do, it might only do some of the things, because it isn’t internally structured as a search process for sequences of actions that lead to high utility behavior, it is instead structured as a bunch of heuristics that were selected for high utility on the training environment that may or may not work well in the new setting.
(In my head, the Partial Agency sequence is meandering towards this conclusion, though I don’t think that’s actually true.)
(I think people have overupdated on “what Rohin believes” from the coherence arguments post—I do think that powerful AI systems will be agent-ish, and EU maximizer-ish, I just don’t think that it is going to be a 100% EU maximizer that chooses actions by considering reasonable sequences of actions and doing the one with the best predicted consequences. With that post, I was primarily arguing against the position that EU maximization is required by math.)
That would probably be part of my response, but I think I’m also considering a different argument.
The thing that I was arguing against was “(c): agents that we build are optimizing some objective function”. This is importantly different from “mesa-optimisers [would] end up being approximately optimal for some objective/utility function” when you consider distributional shift.
It seems plausible that the agent could look like it is “trying to achieve” some simple utility function, and perhaps it would even be approximately optimal for that simple utility function on the training distribution. (Simple here is standing in for “isn’t one of the weird meaningless utility functions in Coherence arguments do not imply goal-directed behavior, and looks more like ‘maximize happiness’ or something like that”.) But if you then take this agent and place it in a different distribution, it wouldn’t do all the things that an EU maximizer with that utility function would do, it might only do some of the things, because it isn’t internally structured as a search process for sequences of actions that lead to high utility behavior, it is instead structured as a bunch of heuristics that were selected for high utility on the training environment that may or may not work well in the new setting.
(In my head, the Partial Agency sequence is meandering towards this conclusion, though I don’t think that’s actually true.)
(I think people have overupdated on “what Rohin believes” from the coherence arguments post—I do think that powerful AI systems will be agent-ish, and EU maximizer-ish, I just don’t think that it is going to be a 100% EU maximizer that chooses actions by considering reasonable sequences of actions and doing the one with the best predicted consequences. With that post, I was primarily arguing against the position that EU maximization is required by math.)