I hold this view; none of those are reasons for my view. The reason is much more simple—before x-risk level failures, we’ll see less catastrophic (but still potentially very bad) failures for the same underlying reason. We’ll notice this, understand it, and fix the issue.
(A crux I expect people to have is whether we’ll actually fix the issue or “apply a bandaid” that is only a superficial fix.)
Yeah, this is why I think some kind of discontinuity is important to my case. I expect different kinds of problems to arise with very very capable systems. So I don’t see why it makes sense to expect smaller problems to arise first which indicate the potential larger problems and allow people to avert them before they occur.
If a case could be made that all potential problems with very very capable systems could be expected to first arise in survivable forms in moderately capable systems, then I would see how the more empirical style of development could give rise to safe systems.
Can you elaborate on what kinds of problems you expect to arise pre vs. post discontinuity?
E.g. will we see “sinister stumbles” (IIRC this was Adam Gleave’s name for half-baked treacherous turns)? I think we will, FWIW.
Or do you think the discontinuity will be more in the realm of embedded agency style concerns (and how does this make it less safe, instead of just dysfunctional?)
How about mesa-optimization? (I think we already see qualitatively similar phenomena, but my idea of this doesn’t emphasize the “optimization” part.)
Jessica’s posts about MIRI vs. Paul’s views made it seem like MIRI might be quite concerned about the first AGI arising via mesa-optimization. This seems likely to me, and would also be a case where I’d expect, unless ML becomes “woke” to mesa-optimization (which seems likely to happen, and not too hard to make happen, to me), we’d see something that *looks* like a discontinuity, but is *actually* more like “the same reason”.
Or do you think the discontinuity will be more in the realm of embedded agency style concerns (and how does this make it less safe, instead of just dysfunctional?)
This in particular doesn’t match my model. Quoting some relevant bits from Embedded Agency:
So I’m not talking about agents who know their own actions because I think there’s going to be a big problem with intelligent machines inferring their own actions in the future. Rather, the possibility of knowing your own actions illustrates something confusing about determining the consequences of your actions—a confusion which shows up even in the very simple case where everything about the world is known and you just need to choose the larger pile of money.
[...]
But it’s not that I’m imagining real-world embedded systems being “too Bayesian” and this somehow causing problems, if we don’t figure out what’s wrong with current models of rational agency. It’s certainly not that I’m imagining future AI systems being written in second-order logic! In most cases, I’m not trying at all to draw direct lines between research problems and specific AI failure modes.
What I’m instead thinking about is this: We sure do seem to be working with the wrong basic concepts today when we try to think about what agency is, as seen by the fact that these concepts don’t transfer well to the more realistic embedded framework.
I also predict that there will be types of failure we will not notice, or will misinterpret. It seems fairly likely to me proto-AGI (i.e. AI that could autonomously learn to become AGI within <~10yrs of acting in the real world) is deployed and creates proto-AGI subagents, some of which we don’t become aware of (e.g. because accidental/incidental/deliberate steganography) and/or are unable to keep track of. And then those continue to survive and reproduce, etc… I guess this only seems plausible if the proto-AGI has a hospitable environment (like the internet, human brains/memes) and/or means of reproduction in the real world.
A very similar problem would be a form of longer-term “seeding”, where an AI (at any stage) with a sufficiently advanced model of the world and long horizons discovers strategies for increasing the chances (“at the margin”) that its values dominate in the long-term future. With my limited knowledge of physics, I imagine there might be ways of doing this just by beaming signals into space in a way calculated to influence/spur the development of life/culture in other parts of the galaxy.
I notice a lot of what I said above makes less sense if you think of AIs as having a similar skill profile to humans, but I think we agree that AIs might be much more advanced than people in some respects while still falling short of AGI because of weaknesses in other areas.
That observation also cuts against the argument you make about warning signs, I think, as it suggests that we might significantly underestimate an AIs (e.g. vastly superhuman) skill in some areas, if it still fails at some things we think are easy. To pull an example (not meant to be realistic) out of a hat: we might have AIs that can’t carry on a conversations, but can implement a very sophisticated covert world domination strategy.
It seems fairly likely to me proto-AGI (i.e. AI that could autonomously learn to become AGI within <~10yrs of acting in the real world) is deployed and creates proto-AGI subagents, some of which we don’t become aware of (e.g. because accidental/incidental/deliberate steganography) and/or are unable to keep track of. And then those continue to survive and reproduce, etc…
Now I’m wondering if it makes sense to model past or present cognitive-cultural information processes in a similar fashion. Memetic and cultural evolutions are a thing and any agentlike processes that spawn could piggypack on our existing general intelligence architecture.
Yeah, I think it totally does! (and that’s a very interesting / “trippy” line of thought :D)
However, it does seem to me somewhat unlikely, since it does require fairly advanced intelligence, and I don’t think evolution is likely to have produced such advanced intelligence with us being totally unaware, whereas I think something about the way we train AI is more strongly selecting for “savant-like” intelligence, which is sort of what I’m imagining here. I can’t think of why I have that intuition OTTMH.
That observation also cuts against the argument you make about warning signs, I think, as it suggests that we might significantly underestimate an AIs (e.g. vastly superhuman) skill in some areas, if it still fails at some things we think are easy.
Nobody denies that AI is really good at extracting patterns out of statistical data (e.g. image classification, speech-to-text, and so on), even though AI is absolutely terrible at many “easy” things. This, and the linked comment from Eliezer, seem to be drastically underselling the competence of AI researchers. (I could imagine it happening with strong enough competitive pressures though.)
I also predict that there will be types of failure we will not notice, or will misinterpret. [...]
All of this assumes some very good long-term planning capabilities. I expect long-term planning to be one of the last capabilities that AI systems get. If I thought they would get them early, I’d be more worried about scenarios like these.
So I don’t take EY’s post as about AI researchers’ competence, as much as their incentives and levels of rationality and paranoia. It does include significant competitive pressures, which seems realistic to me.
I don’t think I’m underestimating AI researchers, either, but for a different reason… let me elaborate a bit: I think there are waaaaaay to many skills for us to hope to have a reasonable sense of what an AI is actually good at. By skills I’m imagining something more like options, or having accurate generalized value functions (GVFs), than tasks.
Regarding long-term planning, I’d factor this into 2 components:
1) having a good planning algorithm
2) having a good world model
I think the way long-term planning works is that you do short-term planning in a good hierarchical world model. I think AIs will have vastly superhuman planning algorithms (arguably, they already do), so the real bottleneck is the world-model.
I don’t think its necessary to have a very “complete” world-model (i.e. enough knowledge to look smart to a person) in order to find “steganographic” long-term strategies like the ones I’m imagining.
I also don’t think it’s even necessary to have anything that looks very much like a world-model. The AI can just have a few good GVFs.… (i.e. be some sort of savant).
I hold this view; none of those are reasons for my view. The reason is much more simple—before x-risk level failures, we’ll see less catastrophic (but still potentially very bad) failures for the same underlying reason. We’ll notice this, understand it, and fix the issue.
(A crux I expect people to have is whether we’ll actually fix the issue or “apply a bandaid” that is only a superficial fix.)
Yeah, this is why I think some kind of discontinuity is important to my case. I expect different kinds of problems to arise with very very capable systems. So I don’t see why it makes sense to expect smaller problems to arise first which indicate the potential larger problems and allow people to avert them before they occur.
If a case could be made that all potential problems with very very capable systems could be expected to first arise in survivable forms in moderately capable systems, then I would see how the more empirical style of development could give rise to safe systems.
Can you elaborate on what kinds of problems you expect to arise pre vs. post discontinuity?
E.g. will we see “sinister stumbles” (IIRC this was Adam Gleave’s name for half-baked treacherous turns)? I think we will, FWIW.
Or do you think the discontinuity will be more in the realm of embedded agency style concerns (and how does this make it less safe, instead of just dysfunctional?)
How about mesa-optimization? (I think we already see qualitatively similar phenomena, but my idea of this doesn’t emphasize the “optimization” part.)
Jessica’s posts about MIRI vs. Paul’s views made it seem like MIRI might be quite concerned about the first AGI arising via mesa-optimization. This seems likely to me, and would also be a case where I’d expect, unless ML becomes “woke” to mesa-optimization (which seems likely to happen, and not too hard to make happen, to me), we’d see something that *looks* like a discontinuity, but is *actually* more like “the same reason”.
This in particular doesn’t match my model. Quoting some relevant bits from Embedded Agency:
This is also the topic of The Rocket Alignment Problem.
Interesting. Your crux seems good; I think it’s a crux for us. I expect things play out more like Eliezer predicts here: https://www.facebook.com/jefftk/posts/886930452142?comment_id=886983450932&comment_tracking=%7B%22tn%22%3A%22R%22%7D&hc_location=ufi
I also predict that there will be types of failure we will not notice, or will misinterpret. It seems fairly likely to me proto-AGI (i.e. AI that could autonomously learn to become AGI within <~10yrs of acting in the real world) is deployed and creates proto-AGI subagents, some of which we don’t become aware of (e.g. because accidental/incidental/deliberate steganography) and/or are unable to keep track of. And then those continue to survive and reproduce, etc… I guess this only seems plausible if the proto-AGI has a hospitable environment (like the internet, human brains/memes) and/or means of reproduction in the real world.
A very similar problem would be a form of longer-term “seeding”, where an AI (at any stage) with a sufficiently advanced model of the world and long horizons discovers strategies for increasing the chances (“at the margin”) that its values dominate in the long-term future. With my limited knowledge of physics, I imagine there might be ways of doing this just by beaming signals into space in a way calculated to influence/spur the development of life/culture in other parts of the galaxy.
I notice a lot of what I said above makes less sense if you think of AIs as having a similar skill profile to humans, but I think we agree that AIs might be much more advanced than people in some respects while still falling short of AGI because of weaknesses in other areas.
That observation also cuts against the argument you make about warning signs, I think, as it suggests that we might significantly underestimate an AIs (e.g. vastly superhuman) skill in some areas, if it still fails at some things we think are easy. To pull an example (not meant to be realistic) out of a hat: we might have AIs that can’t carry on a conversations, but can implement a very sophisticated covert world domination strategy.
Now I’m wondering if it makes sense to model past or present cognitive-cultural information processes in a similar fashion. Memetic and cultural evolutions are a thing and any agentlike processes that spawn could piggypack on our existing general intelligence architecture.
Yeah, I think it totally does! (and that’s a very interesting / “trippy” line of thought :D)
However, it does seem to me somewhat unlikely, since it does require fairly advanced intelligence, and I don’t think evolution is likely to have produced such advanced intelligence with us being totally unaware, whereas I think something about the way we train AI is more strongly selecting for “savant-like” intelligence, which is sort of what I’m imagining here. I can’t think of why I have that intuition OTTMH.
Nobody denies that AI is really good at extracting patterns out of statistical data (e.g. image classification, speech-to-text, and so on), even though AI is absolutely terrible at many “easy” things. This, and the linked comment from Eliezer, seem to be drastically underselling the competence of AI researchers. (I could imagine it happening with strong enough competitive pressures though.)
All of this assumes some very good long-term planning capabilities. I expect long-term planning to be one of the last capabilities that AI systems get. If I thought they would get them early, I’d be more worried about scenarios like these.
So I don’t take EY’s post as about AI researchers’ competence, as much as their incentives and levels of rationality and paranoia. It does include significant competitive pressures, which seems realistic to me.
I don’t think I’m underestimating AI researchers, either, but for a different reason… let me elaborate a bit: I think there are waaaaaay to many skills for us to hope to have a reasonable sense of what an AI is actually good at. By skills I’m imagining something more like options, or having accurate generalized value functions (GVFs), than tasks.
Regarding long-term planning, I’d factor this into 2 components:
1) having a good planning algorithm
2) having a good world model
I think the way long-term planning works is that you do short-term planning in a good hierarchical world model. I think AIs will have vastly superhuman planning algorithms (arguably, they already do), so the real bottleneck is the world-model.
I don’t think its necessary to have a very “complete” world-model (i.e. enough knowledge to look smart to a person) in order to find “steganographic” long-term strategies like the ones I’m imagining.
I also don’t think it’s even necessary to have anything that looks very much like a world-model. The AI can just have a few good GVFs.… (i.e. be some sort of savant).