I wish that everyone (including OP) would be clearer about whether or not we’re doing worst-case thinking, and why.
In particular, if the AGI has some pile of kludges disproportionately pointed towards accomplishing X, and the AGI does self-reflection and “irons itself out”, my prediction is “maybe this AGI will wind up pursuing X, or maybe not, I dunno”. I don’t have a strong reason to expect that to happen, and I also don’t have a strong reason to expect that to not happen. I mostly feel uncertain and confused.
So if the debate is “Are Eliezer & Nate right about ≳99% (or whatever) chance of doom?”, then I find myself on the optimistic side (at least, leaving aside the non-technical parts of the problem), whereas if the debate is “Do we have a strong reason to believe that thus-and-such plan will actually solve technical alignment?”, then I find myself on the pessimistic side.
Separately, I don’t think it’s true that reflectively-stable hard superintelligence needs to have a particular behavioral goal, for reasons here.
I don’t see this as worst-case thinking. I do see it as speaking from a model that many locals don’t share (without any particular attempt made to argue that model).
In particular, if the AGI has some pile of kludges disproportionately pointed towards accomplishing X, and the AGI does self-reflection and “irons itself out”, my prediction is “maybe this AGI will wind up pursuing X, or maybe not, I dunno”.
AFAICT, our degree of disagreement here turns on what you mean by “pointed”. Depending on that, I expect I’d either say “yeah maybe, but that kind of pointing is hard” or “yep, my highest-credence models have pretty high probability on this thing failing to optimize X once it’s sorted out”.
For instance, the latter response obtains if the “pointing” is done by naive training.
(Though I also have some sense that I see the situation as more fragile than you—there’s lots of ways for reflection to ruin your day, if the wrong kludge is pointed the wrong way. So maybe we have a broader disagreement about that, too.)
Also, as a reminder, my high credence in doom doesn’t come from high confidence in a claim like this. You can maybe get one nine here; I doubt you can get three. My high credence in doom comes from its disjunctive nature.
Also, as a reminder, my high credence in doom doesn’t come from high confidence in a claim like this. You can maybe get one nine here; I doubt you can get three. My high credence in doom comes from its disjunctive nature.
Given a sufficiently Kludgy pile of heuristics, it won’t make another AI, unless it has a heuristic towards making AI. (In which case the kind of AI it makes depend on its AI making heuristics. ) GPT5 won’t code an AI to minimize predictive error on text. It will code some random AI that looks like something in the training dataset. And will care more about what the variable names are than what the AI actually does.
Big piles of kludges usually arise from training a kludge finding algorithm (like deep learning). So the only ways agents could get AI building kludges is from making dumb AI’s or reading human writings.
Alternately, maybe the AI has sophisticated self reflection. It is looking at its own kludges and trying to figure out what it values. In which case, does the AI’s metaethics contain a simplicity prior? With a strong simplicity prior, an agent with a bunch of kludges that mostly maximized diamond could turn into an actual crystaline diamond maximizer. If it doesn’t have that simplicity prior, I would guess it ended up optimizing some complicated utility function. (But probably producing a lot of diamond as it did so, diamond isn’t the only component of it’s utility, but it is a big one.)
For my part, I expect a pile of kludges (learned via online model-based RL) to eventually guide the AI into doing self-reflection. (Self-reflection is, after all, instrumentally convergent.) If I’m right, then it would be pretty hard to reason about what will happen during self-reflection in any detail. Likewise, it would be pretty hard to intervene in how the self-reflection will work.
E.g. we can’t just “put in” or “not put in” a simplicity prior. The closest thing that we could do is try to guess whether or not a “simplicity kludge” would have emerged, and to what extent that kludge would be active in the particular context of self-reflection, etc.—which seems awfully fraught.
To be clear, while I think it would be pretty hard to intervene on the self-reflection process, I don’t think it’s impossible. I don’t have any great ideas right now but it’s one of the things I’m working on.
I wish that everyone (including OP) would be clearer about whether or not we’re doing worst-case thinking, and why.
In particular, if the AGI has some pile of kludges disproportionately pointed towards accomplishing X, and the AGI does self-reflection and “irons itself out”, my prediction is “maybe this AGI will wind up pursuing X, or maybe not, I dunno”. I don’t have a strong reason to expect that to happen, and I also don’t have a strong reason to expect that to not happen. I mostly feel uncertain and confused.
So if the debate is “Are Eliezer & Nate right about ≳99% (or whatever) chance of doom?”, then I find myself on the optimistic side (at least, leaving aside the non-technical parts of the problem), whereas if the debate is “Do we have a strong reason to believe that thus-and-such plan will actually solve technical alignment?”, then I find myself on the pessimistic side.
Separately, I don’t think it’s true that reflectively-stable hard superintelligence needs to have a particular behavioral goal, for reasons here.
I don’t see this as worst-case thinking. I do see it as speaking from a model that many locals don’t share (without any particular attempt made to argue that model).
AFAICT, our degree of disagreement here turns on what you mean by “pointed”. Depending on that, I expect I’d either say “yeah maybe, but that kind of pointing is hard” or “yep, my highest-credence models have pretty high probability on this thing failing to optimize X once it’s sorted out”.
For instance, the latter response obtains if the “pointing” is done by naive training.
(Though I also have some sense that I see the situation as more fragile than you—there’s lots of ways for reflection to ruin your day, if the wrong kludge is pointed the wrong way. So maybe we have a broader disagreement about that, too.)
Also, as a reminder, my high credence in doom doesn’t come from high confidence in a claim like this. You can maybe get one nine here; I doubt you can get three. My high credence in doom comes from its disjunctive nature.
Oh, sorry. I’m “uncertain” assuming Model-Based RL with the least-doomed plan that I feel like I more-or-less know how to implement right now. If we’re talking about “naïve training”, then I’m probably very pessimistic, depending on the details.
That’s helpful, thanks!
UPDATE: The “least-doomed plan” I mentioned is now described in a more simple & self-contained post, for readers’ convenience. :)
Given a sufficiently Kludgy pile of heuristics, it won’t make another AI, unless it has a heuristic towards making AI. (In which case the kind of AI it makes depend on its AI making heuristics. ) GPT5 won’t code an AI to minimize predictive error on text. It will code some random AI that looks like something in the training dataset. And will care more about what the variable names are than what the AI actually does.
Big piles of kludges usually arise from training a kludge finding algorithm (like deep learning). So the only ways agents could get AI building kludges is from making dumb AI’s or reading human writings.
Alternately, maybe the AI has sophisticated self reflection. It is looking at its own kludges and trying to figure out what it values. In which case, does the AI’s metaethics contain a simplicity prior? With a strong simplicity prior, an agent with a bunch of kludges that mostly maximized diamond could turn into an actual crystaline diamond maximizer. If it doesn’t have that simplicity prior, I would guess it ended up optimizing some complicated utility function. (But probably producing a lot of diamond as it did so, diamond isn’t the only component of it’s utility, but it is a big one.)
For my part, I expect a pile of kludges (learned via online model-based RL) to eventually guide the AI into doing self-reflection. (Self-reflection is, after all, instrumentally convergent.) If I’m right, then it would be pretty hard to reason about what will happen during self-reflection in any detail. Likewise, it would be pretty hard to intervene in how the self-reflection will work.
E.g. we can’t just “put in” or “not put in” a simplicity prior. The closest thing that we could do is try to guess whether or not a “simplicity kludge” would have emerged, and to what extent that kludge would be active in the particular context of self-reflection, etc.—which seems awfully fraught.
To be clear, while I think it would be pretty hard to intervene on the self-reflection process, I don’t think it’s impossible. I don’t have any great ideas right now but it’s one of the things I’m working on.