Yeah, this is why I think some kind of discontinuity is important to my case. I expect different kinds of problems to arise with very very capable systems. So I don’t see why it makes sense to expect smaller problems to arise first which indicate the potential larger problems and allow people to avert them before they occur.
If a case could be made that all potential problems with very very capable systems could be expected to first arise in survivable forms in moderately capable systems, then I would see how the more empirical style of development could give rise to safe systems.
Can you elaborate on what kinds of problems you expect to arise pre vs. post discontinuity?
E.g. will we see “sinister stumbles” (IIRC this was Adam Gleave’s name for half-baked treacherous turns)? I think we will, FWIW.
Or do you think the discontinuity will be more in the realm of embedded agency style concerns (and how does this make it less safe, instead of just dysfunctional?)
How about mesa-optimization? (I think we already see qualitatively similar phenomena, but my idea of this doesn’t emphasize the “optimization” part.)
Jessica’s posts about MIRI vs. Paul’s views made it seem like MIRI might be quite concerned about the first AGI arising via mesa-optimization. This seems likely to me, and would also be a case where I’d expect, unless ML becomes “woke” to mesa-optimization (which seems likely to happen, and not too hard to make happen, to me), we’d see something that *looks* like a discontinuity, but is *actually* more like “the same reason”.
Or do you think the discontinuity will be more in the realm of embedded agency style concerns (and how does this make it less safe, instead of just dysfunctional?)
This in particular doesn’t match my model. Quoting some relevant bits from Embedded Agency:
So I’m not talking about agents who know their own actions because I think there’s going to be a big problem with intelligent machines inferring their own actions in the future. Rather, the possibility of knowing your own actions illustrates something confusing about determining the consequences of your actions—a confusion which shows up even in the very simple case where everything about the world is known and you just need to choose the larger pile of money.
[...]
But it’s not that I’m imagining real-world embedded systems being “too Bayesian” and this somehow causing problems, if we don’t figure out what’s wrong with current models of rational agency. It’s certainly not that I’m imagining future AI systems being written in second-order logic! In most cases, I’m not trying at all to draw direct lines between research problems and specific AI failure modes.
What I’m instead thinking about is this: We sure do seem to be working with the wrong basic concepts today when we try to think about what agency is, as seen by the fact that these concepts don’t transfer well to the more realistic embedded framework.
Yeah, this is why I think some kind of discontinuity is important to my case. I expect different kinds of problems to arise with very very capable systems. So I don’t see why it makes sense to expect smaller problems to arise first which indicate the potential larger problems and allow people to avert them before they occur.
If a case could be made that all potential problems with very very capable systems could be expected to first arise in survivable forms in moderately capable systems, then I would see how the more empirical style of development could give rise to safe systems.
Can you elaborate on what kinds of problems you expect to arise pre vs. post discontinuity?
E.g. will we see “sinister stumbles” (IIRC this was Adam Gleave’s name for half-baked treacherous turns)? I think we will, FWIW.
Or do you think the discontinuity will be more in the realm of embedded agency style concerns (and how does this make it less safe, instead of just dysfunctional?)
How about mesa-optimization? (I think we already see qualitatively similar phenomena, but my idea of this doesn’t emphasize the “optimization” part.)
Jessica’s posts about MIRI vs. Paul’s views made it seem like MIRI might be quite concerned about the first AGI arising via mesa-optimization. This seems likely to me, and would also be a case where I’d expect, unless ML becomes “woke” to mesa-optimization (which seems likely to happen, and not too hard to make happen, to me), we’d see something that *looks* like a discontinuity, but is *actually* more like “the same reason”.
This in particular doesn’t match my model. Quoting some relevant bits from Embedded Agency:
This is also the topic of The Rocket Alignment Problem.