Can you elaborate on what kinds of problems you expect to arise pre vs. post discontinuity?
E.g. will we see “sinister stumbles” (IIRC this was Adam Gleave’s name for half-baked treacherous turns)? I think we will, FWIW.
Or do you think the discontinuity will be more in the realm of embedded agency style concerns (and how does this make it less safe, instead of just dysfunctional?)
How about mesa-optimization? (I think we already see qualitatively similar phenomena, but my idea of this doesn’t emphasize the “optimization” part.)
Jessica’s posts about MIRI vs. Paul’s views made it seem like MIRI might be quite concerned about the first AGI arising via mesa-optimization. This seems likely to me, and would also be a case where I’d expect, unless ML becomes “woke” to mesa-optimization (which seems likely to happen, and not too hard to make happen, to me), we’d see something that *looks* like a discontinuity, but is *actually* more like “the same reason”.
Or do you think the discontinuity will be more in the realm of embedded agency style concerns (and how does this make it less safe, instead of just dysfunctional?)
This in particular doesn’t match my model. Quoting some relevant bits from Embedded Agency:
So I’m not talking about agents who know their own actions because I think there’s going to be a big problem with intelligent machines inferring their own actions in the future. Rather, the possibility of knowing your own actions illustrates something confusing about determining the consequences of your actions—a confusion which shows up even in the very simple case where everything about the world is known and you just need to choose the larger pile of money.
[...]
But it’s not that I’m imagining real-world embedded systems being “too Bayesian” and this somehow causing problems, if we don’t figure out what’s wrong with current models of rational agency. It’s certainly not that I’m imagining future AI systems being written in second-order logic! In most cases, I’m not trying at all to draw direct lines between research problems and specific AI failure modes.
What I’m instead thinking about is this: We sure do seem to be working with the wrong basic concepts today when we try to think about what agency is, as seen by the fact that these concepts don’t transfer well to the more realistic embedded framework.
Can you elaborate on what kinds of problems you expect to arise pre vs. post discontinuity?
E.g. will we see “sinister stumbles” (IIRC this was Adam Gleave’s name for half-baked treacherous turns)? I think we will, FWIW.
Or do you think the discontinuity will be more in the realm of embedded agency style concerns (and how does this make it less safe, instead of just dysfunctional?)
How about mesa-optimization? (I think we already see qualitatively similar phenomena, but my idea of this doesn’t emphasize the “optimization” part.)
Jessica’s posts about MIRI vs. Paul’s views made it seem like MIRI might be quite concerned about the first AGI arising via mesa-optimization. This seems likely to me, and would also be a case where I’d expect, unless ML becomes “woke” to mesa-optimization (which seems likely to happen, and not too hard to make happen, to me), we’d see something that *looks* like a discontinuity, but is *actually* more like “the same reason”.
This in particular doesn’t match my model. Quoting some relevant bits from Embedded Agency:
This is also the topic of The Rocket Alignment Problem.