Yeah, but if you generalize from humans another way (“they tend not to destroy the world and tend to care about other humans”), you’ll come to a wildly different conclusion
Sure. I mean, that seems like a meaningfully weaker generalization, but sure. That’s not the main issue.
Here’s how the whole situation looks like from my perspective:
We don’t know how generally-intelligent entities like humans work, what the general-intelligence capability is entangled with.
Our only reference point is humans. Human exhibit a lot of dangerous properties, like deceptiveness and consequentialist-like reasoning that seems to be able to disregard contextually-learned values.
There are some gears-level models that suggest intelligence is necessarily entangled with deception-ability (e. g., mine), and some gears-level models that suggest it’s not (e. g., yours). Overall, we have no definitive evidence either way. We have not reverse-engineered any generally-intelligent entities.
We have some insight into how SOTA AIs work. But SOTA AIs are not generally intelligent. Whatever safety assurances our insights into SOTA AIs give us, do not necessarily generalize to AGI.
SOTA AIs are, nevertheless, superhuman at some tasks at which we’ve managed to get them working so far. By volume, GPT-4 can outperform teams of coders, and Midjourney is putting artists out of business. The hallucinations are a problem, but if it were gone, they’d plausibly wipe out whole industries.
An AI that outperforms humans at deception and strategy by the same margin as GPT-4/Midjourney outperform them at writing/coding/drawing would plausibly be an extinction-level threat.
The AI industry leaders are purposefully trying to build a generally-intelligent AI.
The AI industry leaders are not rigorously checking every architectural tweak or cute AutoGPT setup to ensure that it’s not going to give their model room to develop deceptive alignment and other human-like issues.
Summing up: There’s reasonable doubt regarding whether AGIs would necessarily be deception-capable. Highly deception-capable AGIs would plausibly be an extinction risk. The AI industry is currently trying to blindly-but-purposefully wander in the direction of AGI.
Even shorter: There’s a plausible case that, on its current course, the AI industry is going to generate an extinction-capable AI model.
There are no ironclad arguments against that, unless you buy into your inside-view model of generally-intelligent cognition as hard as I buy into mine.
And what you effectively seem to be saying is “until you can rigorously prove that AGIs are going to develop dangerous extinction-level capabilities, it is totally fine to continue blindly scaling and tinkering with architectures”.
What I’m saying is “until you can rigorously prove that a given scale-up plus architectural tweak isn’t going to result in a superhuman extinction-enthusiastic AGI, you should not be allowed to test that empirically”.
Yes, “prove that this technological advance isn’t going to kill us all or you’re not allowed to do it” is a ridiculous standard to apply in the general case. But in this one case, there’s a plausible-enough argument that it might, and that argument has not actually been soundly refuted by our getting some insight into how LLMs work and coming up with a theory of their cognition.
And what you effectively seem to be saying is “until you can rigorously prove that AGIs are going to develop dangerous extinction-level capabilities, it is totally fine to continue blindly scaling and tinkering with architectures”.
No, I am in fact quite worried about the situation and think there is a 5-15% chance of huge catastrophe on the current course! But I think these AGIs won’t be within-forward-pass deceptively aligned, and instead their agency will eg come from scaffolding-like structures. I think that’s important. I think it’s important that we not eg anchor on old speculation about AIXI or within-forward-pass deceptive-alignment or whatever, and instead consider more realistic threat models and where we can intervene. That doesn’t mean it’s fine and dandy to keep scaling with no concern at all.
The reason my percentage is “only 5 to 15” is because I expect society and firms to deal with these problems as they come up, and for that to generalize pretty well to the next step of experimentation and capabilities advancements; for systems to remain tools until invoked into agents; etc.
(Hopefully this comment of mine clarifies; it feels kinda vague to me.)
What I’m saying is “until you can rigorously prove that a given scale-up plus architectural tweak isn’t going to result in a superhuman extinction-enthusiastic AGI, you should not be allowed to test that empirically”.
No, I am in fact quite worried about the situation
Fair, sorry. I appear to have been arguing with my model of someone holding your general position, rather than with my model of you.
I think these AGIs won’t be within-forward-pass deceptively aligned, and instead their agency will eg come from scaffolding-like structures
Would you outline your full argument for this and the reasoning/evidence backing that argument?
To restate: My claim is that, no matter much empirical evidence we have regarding LLMs’ internals, until we have either an AGI we’ve empirically studied or a formal theory of AGI cognition, we cannot say whether shard-theory-like or classical-agent-like views on it will turn out to have been correct. Arguably, both side of the debate have about the same amount of evidence: generalizations from maybe-valid maybe-not reference classes (humans vs. LLMs) and ambitious but non-rigorous mechanical theories of cognition (the shard theory vs. coherence theorems and their ilk stitched into something like my model).
Sure. I mean, that seems like a meaningfully weaker generalization, but sure. That’s not the main issue.
Here’s how the whole situation looks like from my perspective:
We don’t know how generally-intelligent entities like humans work, what the general-intelligence capability is entangled with.
Our only reference point is humans. Human exhibit a lot of dangerous properties, like deceptiveness and consequentialist-like reasoning that seems to be able to disregard contextually-learned values.
There are some gears-level models that suggest intelligence is necessarily entangled with deception-ability (e. g., mine), and some gears-level models that suggest it’s not (e. g., yours). Overall, we have no definitive evidence either way. We have not reverse-engineered any generally-intelligent entities.
We have some insight into how SOTA AIs work. But SOTA AIs are not generally intelligent. Whatever safety assurances our insights into SOTA AIs give us, do not necessarily generalize to AGI.
SOTA AIs are, nevertheless, superhuman at some tasks at which we’ve managed to get them working so far. By volume, GPT-4 can outperform teams of coders, and Midjourney is putting artists out of business. The hallucinations are a problem, but if it were gone, they’d plausibly wipe out whole industries.
An AI that outperforms humans at deception and strategy by the same margin as GPT-4/Midjourney outperform them at writing/coding/drawing would plausibly be an extinction-level threat.
The AI industry leaders are purposefully trying to build a generally-intelligent AI.
The AI industry leaders are not rigorously checking every architectural tweak or cute AutoGPT setup to ensure that it’s not going to give their model room to develop deceptive alignment and other human-like issues.
Summing up: There’s reasonable doubt regarding whether AGIs would necessarily be deception-capable. Highly deception-capable AGIs would plausibly be an extinction risk. The AI industry is currently trying to blindly-but-purposefully wander in the direction of AGI.
Even shorter: There’s a plausible case that, on its current course, the AI industry is going to generate an extinction-capable AI model.
There are no ironclad arguments against that, unless you buy into your inside-view model of generally-intelligent cognition as hard as I buy into mine.
And what you effectively seem to be saying is “until you can rigorously prove that AGIs are going to develop dangerous extinction-level capabilities, it is totally fine to continue blindly scaling and tinkering with architectures”.
What I’m saying is “until you can rigorously prove that a given scale-up plus architectural tweak isn’t going to result in a superhuman extinction-enthusiastic AGI, you should not be allowed to test that empirically”.
Yes, “prove that this technological advance isn’t going to kill us all or you’re not allowed to do it” is a ridiculous standard to apply in the general case. But in this one case, there’s a plausible-enough argument that it might, and that argument has not actually been soundly refuted by our getting some insight into how LLMs work and coming up with a theory of their cognition.
No, I am in fact quite worried about the situation and think there is a 5-15% chance of huge catastrophe on the current course! But I think these AGIs won’t be within-forward-pass deceptively aligned, and instead their agency will eg come from scaffolding-like structures. I think that’s important. I think it’s important that we not eg anchor on old speculation about AIXI or within-forward-pass deceptive-alignment or whatever, and instead consider more realistic threat models and where we can intervene. That doesn’t mean it’s fine and dandy to keep scaling with no concern at all.
The reason my percentage is “only 5 to 15” is because I expect society and firms to deal with these problems as they come up, and for that to generalize pretty well to the next step of experimentation and capabilities advancements; for systems to remain tools until invoked into agents; etc.
(Hopefully this comment of mine clarifies; it feels kinda vague to me.)
But I do think this is way too high of a bar.
Fair, sorry. I appear to have been arguing with my model of someone holding your general position, rather than with my model of you.
Would you outline your full argument for this and the reasoning/evidence backing that argument?
To restate: My claim is that, no matter much empirical evidence we have regarding LLMs’ internals, until we have either an AGI we’ve empirically studied or a formal theory of AGI cognition, we cannot say whether shard-theory-like or classical-agent-like views on it will turn out to have been correct. Arguably, both side of the debate have about the same amount of evidence: generalizations from maybe-valid maybe-not reference classes (humans vs. LLMs) and ambitious but non-rigorous mechanical theories of cognition (the shard theory vs. coherence theorems and their ilk stitched into something like my model).
Would you disagree? If yes, how so?