I had a chat with Rohin about portions of this interview in an internal slack channel, which I’ll post as replies to this comment (there isn’t much shared state between different threads, I think).
I think it would be… AGI would be a mesa optimizer or inner optimizer, whichever term you prefer. And that that inner optimizer will just sort of have a mishmash of all of these heuristics that point in a particular direction but can’t really be decomposed into ‘here are the objectives, and here is the intelligence’, in the same way that you can’t really decompose humans very well into ‘here are the objectives and here is the intelligence’.
… but it leads to not being as confident in the original arguments. It feels like this should be pushing in the direction of ‘it will be easier to correct or modify or change the AI system’. Many of the arguments for risk are ‘if you have a utility maximizer, it has all of these convergent instrumental sub-goals’ and, I don’t know, if I look at humans they kind of sort of pursued convergent instrumental sub-goals, but not really.
Huh, I see your point as cutting the opposite way. If you have a clean architectural separation between intelligence and goals, I can swap out the goals. But if you have a mish-mash, then for the same degree of vNM rationality (which maybe you think is unrealistic), it’s harder to do anything like ‘swap out the goals’ or ‘analyse the goals for trouble’.
in general, I think the original arguments are:
(a) for a very wide range of objective functions, you can have agents that are very good at optimising them
(b) convergent instrumental subgoals are scary
I think ‘humans don’t have scary convergent instrumental subgoals’ is an argument against (b), but I don’t think (a) or (b) rely on a clean architectural separation between intelligence and goals.
RS I agree both (a) and (b) don’t depend on an architectural separation. But you also need (c): agents that we build are optimizing some objective function, and I think my point cuts against that
DF somewhat. I think you have a remaining argument of ‘if we want to do useful stuff, we will build things that optimise objective functions, since otherwise they randomly waste resources’, but that’s definitely got things to argue with.
(Looking back on this, I’m now confused why Rohin doesn’t think mesa-optimisers wouldn’t end up being approximately optimal for some objective/utility function)
I predict that Rohin would say something like “the phrase ‘approximately optimal for some objective/utility function’ is basically meaningless in this context, because for any behaviour, there’s some function which it’s maximising”.
You might then limit yourself to the set of functions that defines tasks that are interesting or relevant to humans. But then that includes a whole bunch of functions which define safe bounded behaviour as well as a whole bunch which define unsafe unbounded behaviour, and we’re back to being very uncertain about which case we’ll end up in.
That would probably be part of my response, but I think I’m also considering a different argument.
The thing that I was arguing against was “(c): agents that we build are optimizing some objective function”. This is importantly different from “mesa-optimisers [would] end up being approximately optimal for some objective/utility function” when you consider distributional shift.
It seems plausible that the agent could look like it is “trying to achieve” some simple utility function, and perhaps it would even be approximately optimal for that simple utility function on the training distribution. (Simple here is standing in for “isn’t one of the weird meaningless utility functions in Coherence arguments do not imply goal-directed behavior, and looks more like ‘maximize happiness’ or something like that”.) But if you then take this agent and place it in a different distribution, it wouldn’t do all the things that an EU maximizer with that utility function would do, it might only do some of the things, because it isn’t internally structured as a search process for sequences of actions that lead to high utility behavior, it is instead structured as a bunch of heuristics that were selected for high utility on the training environment that may or may not work well in the new setting.
(In my head, the Partial Agency sequence is meandering towards this conclusion, though I don’t think that’s actually true.)
(I think people have overupdated on “what Rohin believes” from the coherence arguments post—I do think that powerful AI systems will be agent-ish, and EU maximizer-ish, I just don’t think that it is going to be a 100% EU maximizer that chooses actions by considering reasonable sequences of actions and doing the one with the best predicted consequences. With that post, I was primarily arguing against the position that EU maximization is required by math.)
I don’t know that MIRI actually believes that what we need to do is write a bunch of proofs about our AI system, but it sure sounds like it, and that seems like a too difficult, and basically impossible task to me, if the proofs that we’re trying to write are about alignment or beneficialness or something like that.
FYI: My understanding of what MIRI (or at least Buck) thinks is that you don’t need to prove your AI system is beneficial, but you should have a strong argument that stands up to strict scrutiny, and some of the sub-arguments will definitely have to be proofs.
RS Seems plausible, I think I feel similarly about that claim
A straw version of this, which isn’t exactly what I mean but sort of is the right intuition, would be like maybe if you run the same… What’s the input that maximizes the output of this neuron? You’ll see that this particular neuron is a deception classifier. It looks at the input and then based on something, does some computation with the input, maybe the input’s like a dialogue between two people and then this neuron is telling you, “Hey, is person A trying to deceive person B right now?” That’s an example of the sort of thing I am imagining.
Huh—plausible that I’m misunderstanding you, but I imagine this being insufficient for safety monitoring because (a) many non-deceptive AIs are going to have the concept of deception anyway, because it’s useful, (b) statically you can’t tell whether or not the network is going to aim for deception just from knowing that it has a representation of deception, and (c) you don’t have a hope of monitoring it online to check if the deception neuron is lighting up when it’s talking to you.
FWIW I believe in the negation of some version of my point (b), where some static analysis reveals some evaluation and planning model, and you find out that in some situations the agent prefers itself being deceptive, where of course this static analysis is significantly more sophisticated than current techniques
RS Yeah, I agree with all of these critiques. I think I’m more pointing at the intuition at why we should expect this to be easier than we might initially think, rather than saying that specific idea is going to work.
E.g. maybe this is a reason that (relaxed) adversarial training actually works great, since the adversary can check whether the deception neuron is lighting up
DF Seems fair, and I think this kind of intuition is why I research what I do.
And then I claim that conditional on that scenario having happened, I am very surprised by the fact that we did not know this deception in any earlier scenario that didn’t lead to extinction. And I don’t really get people’s intuitions for why that would be the case. I haven’t tried to figure that one out though.
I feel like I believe that people notice deception early on but are plausibly wrong about whether or not they’ve fixed it
RS After a few failures, you’d think we’d at least know to expect it?
DF Sure, but if your AI is also getting smarter, then that probably doesn’t help you that much in detecting it, and only one person has to be wrong and deploy (if actually fixing takes a significantly longer time than sort of but not really fixing it) [this comment was written with less than usual carefulness]
RS Seems right, but in general human society / humans seem pretty good at being risk-averse (to the point that it seems to me that on anything that isn’t x-risk the utilitarian thing is to be more risk-seeking), and I’m hopeful that the same will be true here. (Also I’m assuming that it would take a bunch of compute, and it’s not that easy for a single person to deploy an AI, though even in that case I’d be optimistic, given that smallpox hasn’t been released yet.)
DF sorry by ‘one person’ I meant ‘one person in charge of a big team’
RS The hope is that they are constrained by all the typical constraints on such people (shareholders, governments, laws, public opinion, the rest of the team, etc.) Also this significantly decreases the number of people who can do the thing, restricts it to people who are “broadly reasonable” (e.g. no terrorists), and allows us to convince each such person individually. Also I rarely think there is just one person — at the very least you need one person with a bunch of money and resources and another with the technical know-how, and it would be very difficult for these to be the same person
DF Sure. I guess even with those caveats my scenario doesn’t seem that unlikely to me.
RS Sure, I don’t think this is enough to say “yup, this definitely won’t happen”. I think we do disagree on the relative likelihood of it happening, but maybe not by that much. (I’m hesitant to write a number because the scenario isn’t really fleshed out enough yet for us to agree on what we’re writing a number about.)
And the concept of 3D space seems like it’s probably going to be useful for an AI system no matter how smart it gets. Currently, they might have a concept of 3D space, but it’s not obvious that they do. And I wouldn’t be surprised if they don’t.
Presumably at some point they start actually using the concept of 4D locally-Minkowski spacetime instead (or quantum loops or whatever)
and in general—if you have things roughly like human notions of agency or cause, but formalised differently and more correctly than we would, that makes them harder to analyse.
RS I suspect they don’t use 4D spacetime, because it’s not particularly useful for most tasks, and takes more computation.
But I agree with the broader point that abstractions can be formalized differently, and that there can be more alien abstractions. But I’d expect that this happens quite a bit later
DF I mean maybe once you’ve gotten rid of the pesky humans and need to start building dyson spheres… anyway I think curved 4d spacetime does require more computation than standard 3d modelling, but I don’t think that using minkowski spacetime does.
RS Yeah, I think I’m often thinking of the case where AI is somewhat better than humans, rather than building Dyson spheres. Who knows what’s happening at Dyson sphere level. Probably should have said that in the conversation. (I think about it this way because it seems more important to align the first few AIs, and then have them help with aligning future ones.)
DF Sure. But even when you have AI that’s worrying about signal transmission between different cities and the GPS system, SR is not that much more computationally intensive than Newtonian 3D space, and critical for accuracy.
Like I think the additional computational cost is in fact very low, but non-negative.
RS So like in practice if robots end up doing tasks like the ones we do, they develop intuitive physics models like ours, rather than Newtonian mechanics. SR might be only a bit more expensive than Newtonian, but I think most of the computational cost is in switching from heuristics / intuitive physics to a formal theory
(If they do different tasks than what we do, I expect them to develop their own internal physics which is pretty different from ours that they use for most tasks, but still not a formal theory)
DF Ooh, I wasn’t accounting for that but it seems right.
I do think that plausibly in some situations ‘intuitive physics’ takes place in minkowski spacetime.
I also don’t think there’s a discrete point at which you can say, “I’ve won the race.” I think it’s just like capabilities keep improving and you can have more capabilities than the other guy, but at no point can you say, “Now I have won the race.”
I think that (a) this isn’t a disanalogy to nuclear arms races and (b) it’s a sign of danger, since at no point do people feel free to slow down and test safety.
RS I’m confused by (a). Surely you “win” the nuclear arms race once you successfully make a nuke that can be dropped on another country?
(b) seems right, idr if I was arguing for safety or just arguing for disanalogies and wanting more research
DF re (a), if you have nukes that can be dropped on me, I can then make enough nukes to destroy all your nukes. So you make more nukes, so I make more nukes (because I’m worried about my nukes being destroyed) etc. This is historically how it played out, see mid-20th C discussion of the ‘missile gap’.
re (b) fair enough
(it doesn’t actually necessarily play out as clearly as I describe: maybe you get nuclear submarines, I get nuclear submarine detection skills...)
RS (a) Yes, after the first nukes are created, the remainder of the arms race is relatively similar. I was thinking of the race to create the first nuke. (Arguably the US should have used their advantage to prevent all further nukes.)
DF I guess it just seems more natural to me to think of one big long arms race, rather than a bunch of successive races—like, I think if you look at the actual history of nuclear armament, at no point before major powers have tons of nukes are they in a lull, not worrying about making more. But this might be an artefact of me mostly knowing about the US side, which I think was unusual in its nuke production and worrying.
RS Seems reasonable, I think which frame you take will depend on what you’re trying to argue, I don’t remember what I was trying to argue with that. My impression was that when people talk about the “nuclear arms race”, they were talking about the one leading to the creation of the bomb, but I’m not confident in that (and can’t think of any evidence for it right now)
DF
My impression was that when people talk about the “nuclear arms race”, they were talking about the one leading to the creation of the bomb
FWIW I think I’ve only ever heard “nuclear arms race” used to refer to the buildup of more and more weapons, more advancements, etc., not a race to create the first nuclear weapon. And the Wikipedia article by that name opens with:
This page uses the phrase ‘A “Race” for the bomb’ (rather than “nuclear arms race”) to describe the US and Nazi Germany’s respective efforts to create the first nuclear weapon. My impression is that this “race” was a key motivation in beginning the Manhattan Project and in the early stages, but I’m not sure to what extent that “race” remained “live” and remained a key motivation for the US (as opposed the US just clearly being ahead, and now being motivated by having invested a lot and wanting a powerful weapon to win the war sooner). That page says “By 1944, however, the evidence was clear: the Germans had not come close to developing a bomb and had only advanced to preliminary research.”
I had a chat with Rohin about portions of this interview in an internal slack channel, which I’ll post as replies to this comment (there isn’t much shared state between different threads, I think).
DF
Huh, I see your point as cutting the opposite way. If you have a clean architectural separation between intelligence and goals, I can swap out the goals. But if you have a mish-mash, then for the same degree of vNM rationality (which maybe you think is unrealistic), it’s harder to do anything like ‘swap out the goals’ or ‘analyse the goals for trouble’.
in general, I think the original arguments are: (a) for a very wide range of objective functions, you can have agents that are very good at optimising them (b) convergent instrumental subgoals are scary
I think ‘humans don’t have scary convergent instrumental subgoals’ is an argument against (b), but I don’t think (a) or (b) rely on a clean architectural separation between intelligence and goals.
RS I agree both (a) and (b) don’t depend on an architectural separation. But you also need (c): agents that we build are optimizing some objective function, and I think my point cuts against that
DF somewhat. I think you have a remaining argument of ‘if we want to do useful stuff, we will build things that optimise objective functions, since otherwise they randomly waste resources’, but that’s definitely got things to argue with.
(Looking back on this, I’m now confused why Rohin doesn’t think mesa-optimisers wouldn’t end up being approximately optimal for some objective/utility function)
I predict that Rohin would say something like “the phrase ‘approximately optimal for some objective/utility function’ is basically meaningless in this context, because for any behaviour, there’s some function which it’s maximising”.
You might then limit yourself to the set of functions that defines tasks that are interesting or relevant to humans. But then that includes a whole bunch of functions which define safe bounded behaviour as well as a whole bunch which define unsafe unbounded behaviour, and we’re back to being very uncertain about which case we’ll end up in.
That would probably be part of my response, but I think I’m also considering a different argument.
The thing that I was arguing against was “(c): agents that we build are optimizing some objective function”. This is importantly different from “mesa-optimisers [would] end up being approximately optimal for some objective/utility function” when you consider distributional shift.
It seems plausible that the agent could look like it is “trying to achieve” some simple utility function, and perhaps it would even be approximately optimal for that simple utility function on the training distribution. (Simple here is standing in for “isn’t one of the weird meaningless utility functions in Coherence arguments do not imply goal-directed behavior, and looks more like ‘maximize happiness’ or something like that”.) But if you then take this agent and place it in a different distribution, it wouldn’t do all the things that an EU maximizer with that utility function would do, it might only do some of the things, because it isn’t internally structured as a search process for sequences of actions that lead to high utility behavior, it is instead structured as a bunch of heuristics that were selected for high utility on the training environment that may or may not work well in the new setting.
(In my head, the Partial Agency sequence is meandering towards this conclusion, though I don’t think that’s actually true.)
(I think people have overupdated on “what Rohin believes” from the coherence arguments post—I do think that powerful AI systems will be agent-ish, and EU maximizer-ish, I just don’t think that it is going to be a 100% EU maximizer that chooses actions by considering reasonable sequences of actions and doing the one with the best predicted consequences. With that post, I was primarily arguing against the position that EU maximization is required by math.)
DF
FYI: My understanding of what MIRI (or at least Buck) thinks is that you don’t need to prove your AI system is beneficial, but you should have a strong argument that stands up to strict scrutiny, and some of the sub-arguments will definitely have to be proofs.
RS Seems plausible, I think I feel similarly about that claim
DF
Huh—plausible that I’m misunderstanding you, but I imagine this being insufficient for safety monitoring because (a) many non-deceptive AIs are going to have the concept of deception anyway, because it’s useful, (b) statically you can’t tell whether or not the network is going to aim for deception just from knowing that it has a representation of deception, and (c) you don’t have a hope of monitoring it online to check if the deception neuron is lighting up when it’s talking to you.
FWIW I believe in the negation of some version of my point (b), where some static analysis reveals some evaluation and planning model, and you find out that in some situations the agent prefers itself being deceptive, where of course this static analysis is significantly more sophisticated than current techniques
RS Yeah, I agree with all of these critiques. I think I’m more pointing at the intuition at why we should expect this to be easier than we might initially think, rather than saying that specific idea is going to work.
E.g. maybe this is a reason that (relaxed) adversarial training actually works great, since the adversary can check whether the deception neuron is lighting up
DF Seems fair, and I think this kind of intuition is why I research what I do.
DF From your AI impacts interview:
I feel like I believe that people notice deception early on but are plausibly wrong about whether or not they’ve fixed it
RS After a few failures, you’d think we’d at least know to expect it?
DF Sure, but if your AI is also getting smarter, then that probably doesn’t help you that much in detecting it, and only one person has to be wrong and deploy (if actually fixing takes a significantly longer time than sort of but not really fixing it) [this comment was written with less than usual carefulness]
RS Seems right, but in general human society / humans seem pretty good at being risk-averse (to the point that it seems to me that on anything that isn’t x-risk the utilitarian thing is to be more risk-seeking), and I’m hopeful that the same will be true here. (Also I’m assuming that it would take a bunch of compute, and it’s not that easy for a single person to deploy an AI, though even in that case I’d be optimistic, given that smallpox hasn’t been released yet.)
DF sorry by ‘one person’ I meant ‘one person in charge of a big team’
RS The hope is that they are constrained by all the typical constraints on such people (shareholders, governments, laws, public opinion, the rest of the team, etc.) Also this significantly decreases the number of people who can do the thing, restricts it to people who are “broadly reasonable” (e.g. no terrorists), and allows us to convince each such person individually. Also I rarely think there is just one person — at the very least you need one person with a bunch of money and resources and another with the technical know-how, and it would be very difficult for these to be the same person
DF Sure. I guess even with those caveats my scenario doesn’t seem that unlikely to me.
RS Sure, I don’t think this is enough to say “yup, this definitely won’t happen”. I think we do disagree on the relative likelihood of it happening, but maybe not by that much. (I’m hesitant to write a number because the scenario isn’t really fleshed out enough yet for us to agree on what we’re writing a number about.)
DF
Presumably at some point they start actually using the concept of 4D locally-Minkowski spacetime instead (or quantum loops or whatever)
and in general—if you have things roughly like human notions of agency or cause, but formalised differently and more correctly than we would, that makes them harder to analyse.
RS I suspect they don’t use 4D spacetime, because it’s not particularly useful for most tasks, and takes more computation.
But I agree with the broader point that abstractions can be formalized differently, and that there can be more alien abstractions. But I’d expect that this happens quite a bit later
DF I mean maybe once you’ve gotten rid of the pesky humans and need to start building dyson spheres… anyway I think curved 4d spacetime does require more computation than standard 3d modelling, but I don’t think that using minkowski spacetime does.
RS Yeah, I think I’m often thinking of the case where AI is somewhat better than humans, rather than building Dyson spheres. Who knows what’s happening at Dyson sphere level. Probably should have said that in the conversation. (I think about it this way because it seems more important to align the first few AIs, and then have them help with aligning future ones.)
DF Sure. But even when you have AI that’s worrying about signal transmission between different cities and the GPS system, SR is not that much more computationally intensive than Newtonian 3D space, and critical for accuracy.
Like I think the additional computational cost is in fact very low, but non-negative.
RS So like in practice if robots end up doing tasks like the ones we do, they develop intuitive physics models like ours, rather than Newtonian mechanics. SR might be only a bit more expensive than Newtonian, but I think most of the computational cost is in switching from heuristics / intuitive physics to a formal theory
(If they do different tasks than what we do, I expect them to develop their own internal physics which is pretty different from ours that they use for most tasks, but still not a formal theory)
DF Ooh, I wasn’t accounting for that but it seems right.
I do think that plausibly in some situations ‘intuitive physics’ takes place in minkowski spacetime.
DF
I think that (a) this isn’t a disanalogy to nuclear arms races and (b) it’s a sign of danger, since at no point do people feel free to slow down and test safety.
RS I’m confused by (a). Surely you “win” the nuclear arms race once you successfully make a nuke that can be dropped on another country?
(b) seems right, idr if I was arguing for safety or just arguing for disanalogies and wanting more research
DF re (a), if you have nukes that can be dropped on me, I can then make enough nukes to destroy all your nukes. So you make more nukes, so I make more nukes (because I’m worried about my nukes being destroyed) etc. This is historically how it played out, see mid-20th C discussion of the ‘missile gap’.
re (b) fair enough
(it doesn’t actually necessarily play out as clearly as I describe: maybe you get nuclear submarines, I get nuclear submarine detection skills...)
RS (a) Yes, after the first nukes are created, the remainder of the arms race is relatively similar. I was thinking of the race to create the first nuke. (Arguably the US should have used their advantage to prevent all further nukes.)
DF I guess it just seems more natural to me to think of one big long arms race, rather than a bunch of successive races—like, I think if you look at the actual history of nuclear armament, at no point before major powers have tons of nukes are they in a lull, not worrying about making more. But this might be an artefact of me mostly knowing about the US side, which I think was unusual in its nuke production and worrying.
RS Seems reasonable, I think which frame you take will depend on what you’re trying to argue, I don’t remember what I was trying to argue with that. My impression was that when people talk about the “nuclear arms race”, they were talking about the one leading to the creation of the bomb, but I’m not confident in that (and can’t think of any evidence for it right now)
DF
ah, I did not have that impression. Makes sense.
FWIW I think I’ve only ever heard “nuclear arms race” used to refer to the buildup of more and more weapons, more advancements, etc., not a race to create the first nuclear weapon. And the Wikipedia article by that name opens with:
This page uses the phrase ‘A “Race” for the bomb’ (rather than “nuclear arms race”) to describe the US and Nazi Germany’s respective efforts to create the first nuclear weapon. My impression is that this “race” was a key motivation in beginning the Manhattan Project and in the early stages, but I’m not sure to what extent that “race” remained “live” and remained a key motivation for the US (as opposed the US just clearly being ahead, and now being motivated by having invested a lot and wanting a powerful weapon to win the war sooner). That page says “By 1944, however, the evidence was clear: the Germans had not come close to developing a bomb and had only advanced to preliminary research.”
Yeah I think I was probably wrong about this (including what other people were talking about when they said “nuclear arms race”).