I don’t see this as worst-case thinking. I do see it as speaking from a model that many locals don’t share (without any particular attempt made to argue that model).
In particular, if the AGI has some pile of kludges disproportionately pointed towards accomplishing X, and the AGI does self-reflection and “irons itself out”, my prediction is “maybe this AGI will wind up pursuing X, or maybe not, I dunno”.
AFAICT, our degree of disagreement here turns on what you mean by “pointed”. Depending on that, I expect I’d either say “yeah maybe, but that kind of pointing is hard” or “yep, my highest-credence models have pretty high probability on this thing failing to optimize X once it’s sorted out”.
For instance, the latter response obtains if the “pointing” is done by naive training.
(Though I also have some sense that I see the situation as more fragile than you—there’s lots of ways for reflection to ruin your day, if the wrong kludge is pointed the wrong way. So maybe we have a broader disagreement about that, too.)
Also, as a reminder, my high credence in doom doesn’t come from high confidence in a claim like this. You can maybe get one nine here; I doubt you can get three. My high credence in doom comes from its disjunctive nature.
Also, as a reminder, my high credence in doom doesn’t come from high confidence in a claim like this. You can maybe get one nine here; I doubt you can get three. My high credence in doom comes from its disjunctive nature.
I don’t see this as worst-case thinking. I do see it as speaking from a model that many locals don’t share (without any particular attempt made to argue that model).
AFAICT, our degree of disagreement here turns on what you mean by “pointed”. Depending on that, I expect I’d either say “yeah maybe, but that kind of pointing is hard” or “yep, my highest-credence models have pretty high probability on this thing failing to optimize X once it’s sorted out”.
For instance, the latter response obtains if the “pointing” is done by naive training.
(Though I also have some sense that I see the situation as more fragile than you—there’s lots of ways for reflection to ruin your day, if the wrong kludge is pointed the wrong way. So maybe we have a broader disagreement about that, too.)
Also, as a reminder, my high credence in doom doesn’t come from high confidence in a claim like this. You can maybe get one nine here; I doubt you can get three. My high credence in doom comes from its disjunctive nature.
Oh, sorry. I’m “uncertain” assuming Model-Based RL with the least-doomed plan that I feel like I more-or-less know how to implement right now. If we’re talking about “naïve training”, then I’m probably very pessimistic, depending on the details.
That’s helpful, thanks!
UPDATE: The “least-doomed plan” I mentioned is now described in a more simple & self-contained post, for readers’ convenience. :)