I think it would be… AGI would be a mesa optimizer or inner optimizer, whichever term you prefer. And that that inner optimizer will just sort of have a mishmash of all of these heuristics that point in a particular direction but can’t really be decomposed into ‘here are the objectives, and here is the intelligence’, in the same way that you can’t really decompose humans very well into ‘here are the objectives and here is the intelligence’.
… but it leads to not being as confident in the original arguments. It feels like this should be pushing in the direction of ‘it will be easier to correct or modify or change the AI system’. Many of the arguments for risk are ‘if you have a utility maximizer, it has all of these convergent instrumental sub-goals’ and, I don’t know, if I look at humans they kind of sort of pursued convergent instrumental sub-goals, but not really.
Huh, I see your point as cutting the opposite way. If you have a clean architectural separation between intelligence and goals, I can swap out the goals. But if you have a mish-mash, then for the same degree of vNM rationality (which maybe you think is unrealistic), it’s harder to do anything like ‘swap out the goals’ or ‘analyse the goals for trouble’.
in general, I think the original arguments are:
(a) for a very wide range of objective functions, you can have agents that are very good at optimising them
(b) convergent instrumental subgoals are scary
I think ‘humans don’t have scary convergent instrumental subgoals’ is an argument against (b), but I don’t think (a) or (b) rely on a clean architectural separation between intelligence and goals.
RS I agree both (a) and (b) don’t depend on an architectural separation. But you also need (c): agents that we build are optimizing some objective function, and I think my point cuts against that
DF somewhat. I think you have a remaining argument of ‘if we want to do useful stuff, we will build things that optimise objective functions, since otherwise they randomly waste resources’, but that’s definitely got things to argue with.
(Looking back on this, I’m now confused why Rohin doesn’t think mesa-optimisers wouldn’t end up being approximately optimal for some objective/utility function)
I predict that Rohin would say something like “the phrase ‘approximately optimal for some objective/utility function’ is basically meaningless in this context, because for any behaviour, there’s some function which it’s maximising”.
You might then limit yourself to the set of functions that defines tasks that are interesting or relevant to humans. But then that includes a whole bunch of functions which define safe bounded behaviour as well as a whole bunch which define unsafe unbounded behaviour, and we’re back to being very uncertain about which case we’ll end up in.
That would probably be part of my response, but I think I’m also considering a different argument.
The thing that I was arguing against was “(c): agents that we build are optimizing some objective function”. This is importantly different from “mesa-optimisers [would] end up being approximately optimal for some objective/utility function” when you consider distributional shift.
It seems plausible that the agent could look like it is “trying to achieve” some simple utility function, and perhaps it would even be approximately optimal for that simple utility function on the training distribution. (Simple here is standing in for “isn’t one of the weird meaningless utility functions in Coherence arguments do not imply goal-directed behavior, and looks more like ‘maximize happiness’ or something like that”.) But if you then take this agent and place it in a different distribution, it wouldn’t do all the things that an EU maximizer with that utility function would do, it might only do some of the things, because it isn’t internally structured as a search process for sequences of actions that lead to high utility behavior, it is instead structured as a bunch of heuristics that were selected for high utility on the training environment that may or may not work well in the new setting.
(In my head, the Partial Agency sequence is meandering towards this conclusion, though I don’t think that’s actually true.)
(I think people have overupdated on “what Rohin believes” from the coherence arguments post—I do think that powerful AI systems will be agent-ish, and EU maximizer-ish, I just don’t think that it is going to be a 100% EU maximizer that chooses actions by considering reasonable sequences of actions and doing the one with the best predicted consequences. With that post, I was primarily arguing against the position that EU maximization is required by math.)
DF
Huh, I see your point as cutting the opposite way. If you have a clean architectural separation between intelligence and goals, I can swap out the goals. But if you have a mish-mash, then for the same degree of vNM rationality (which maybe you think is unrealistic), it’s harder to do anything like ‘swap out the goals’ or ‘analyse the goals for trouble’.
in general, I think the original arguments are: (a) for a very wide range of objective functions, you can have agents that are very good at optimising them (b) convergent instrumental subgoals are scary
I think ‘humans don’t have scary convergent instrumental subgoals’ is an argument against (b), but I don’t think (a) or (b) rely on a clean architectural separation between intelligence and goals.
RS I agree both (a) and (b) don’t depend on an architectural separation. But you also need (c): agents that we build are optimizing some objective function, and I think my point cuts against that
DF somewhat. I think you have a remaining argument of ‘if we want to do useful stuff, we will build things that optimise objective functions, since otherwise they randomly waste resources’, but that’s definitely got things to argue with.
(Looking back on this, I’m now confused why Rohin doesn’t think mesa-optimisers wouldn’t end up being approximately optimal for some objective/utility function)
I predict that Rohin would say something like “the phrase ‘approximately optimal for some objective/utility function’ is basically meaningless in this context, because for any behaviour, there’s some function which it’s maximising”.
You might then limit yourself to the set of functions that defines tasks that are interesting or relevant to humans. But then that includes a whole bunch of functions which define safe bounded behaviour as well as a whole bunch which define unsafe unbounded behaviour, and we’re back to being very uncertain about which case we’ll end up in.
That would probably be part of my response, but I think I’m also considering a different argument.
The thing that I was arguing against was “(c): agents that we build are optimizing some objective function”. This is importantly different from “mesa-optimisers [would] end up being approximately optimal for some objective/utility function” when you consider distributional shift.
It seems plausible that the agent could look like it is “trying to achieve” some simple utility function, and perhaps it would even be approximately optimal for that simple utility function on the training distribution. (Simple here is standing in for “isn’t one of the weird meaningless utility functions in Coherence arguments do not imply goal-directed behavior, and looks more like ‘maximize happiness’ or something like that”.) But if you then take this agent and place it in a different distribution, it wouldn’t do all the things that an EU maximizer with that utility function would do, it might only do some of the things, because it isn’t internally structured as a search process for sequences of actions that lead to high utility behavior, it is instead structured as a bunch of heuristics that were selected for high utility on the training environment that may or may not work well in the new setting.
(In my head, the Partial Agency sequence is meandering towards this conclusion, though I don’t think that’s actually true.)
(I think people have overupdated on “what Rohin believes” from the coherence arguments post—I do think that powerful AI systems will be agent-ish, and EU maximizer-ish, I just don’t think that it is going to be a 100% EU maximizer that chooses actions by considering reasonable sequences of actions and doing the one with the best predicted consequences. With that post, I was primarily arguing against the position that EU maximization is required by math.)