It seems to me that you have very high confidence in being able to predict the “eventual” architecture / internal composition of AGI. I don’t know where that apparent confidence is coming from.
The “canonical” views are concerned with scarily powerful artificial agents: with systems that are human-like in their ability to model the world and take consequentialist actions in it, but inhuman in their processing power and in their value systems.
I would instead say:
The canonical views dreamed up systems which don’t exist, which have never existed, and which might not ever exist.[1] Given those assumptions, some people have drawn strong conclusions about AGI risk.
We have to remember that there is AI which we know can exist (LLMs) and there is first-principles speculation about what AGI might look like (which may or may not be realized). And so rather than justifying “does current evidence apply to ‘superintelligences’?”, I’d like to see justification of “under what conditions does speculation about ‘superintelligent consequentialism’ merit research attention at all?” and “why do we think ‘future architectures’ will have property X, or whatever?!”.
The views might have, for example, fundamentally misunderstood how cognition and motivation work (anyone remember worrying about ‘but how do I get an AI to rescue my mom from a burning building, without specifying my whole set of values’?).
We have to remember that there is AI which we know can exist (LLMs) and there is first-principles speculation about what AGI might look like (which may or may not be realized).
I disagree that it is actually “first-principles”. It is based on generalizing from humans, and on the types of entities (idealized utility-maximizing agents) that humans could be modeled as approximating in specific contexts in which they steer the world towards their goals most powerfully.
As I’d tried to outline in the post, I think “what are AIs that are known to exist, and what properties do they have?” is just the wrong question to focus on. The shared “AI” label is a red herring. The relevant question is “what are scarily powerful generally-intelligent systems that exist, and what properties do they have?”, and the only relevant data point seems to be humans.
And as far as omnicide risk is concerned, the question shouldn’t be “how can you prove these systems will have the threatening property X, like humans do?” but “how can you prove these systems won’t have the threatening property X, like humans do?”.
I disagree that it is actually “first-principles”. It is based on generalizing from humans, and on the types of entities (idealized utility-maximizing agents) that humans could be modeled as approximating in specific contexts in which they steer the world towards their goals most powerfully.
Yeah, but if you generalize from humans another way (“they tend not to destroy the world and tend to care about other humans”), you’ll come to a wildly different conclusion. The conclusion should not be sensitive to poorly motivated reference classes and frames, unless it’s really clear why we’re using one frame. This is a huge peril of reasoning by analogy.
Whenever attempting to draw conclusions by analogy, it’s important that there be shared causal mechanisms which produce the outcome of interest. For example, I can simulate a spring using an analog computer because both systems are roughly governed by similar differential equations. In shard theory, I posited that there’s a shared mechanism of “local updating via self-supervised and TD learning on ~randomly initialized neural networks” which leads to things like “contextually activated heuristics” (or “shards”).
Here, it isn’t clear what the shared mechanism is supposed to be, such that both (future) AI and humans have it. Suppose I grant that if a system is “smart” and has “goals”, then bad things can happen. Let’s call that the “bad agency” hypothesis.
But how do we know that future AI will have the relevant cognitive structures for “bad agency” to be satisfied? How do we know that the AI will have internal goal representations which chain into each other across contexts, so that the AI reliably pursues one or more goals over time? How do we know that the mechanisms are similar enough for the human->AI analogy to provide meaningful evidence on this particular question?
I expect there to be “bad agency” systems eventually, but it really matters what kind we’re talking about. If you’re thinking of “secret deceptive alignment that never externalizes in the chain-of-thought” and I’m thinking about “scaffolded models prompted to be agentic and bad”, then our interventions will be wildly different.
Yeah, but if you generalize from humans another way (“they tend not to destroy the world and tend to care about other humans”), you’ll come to a wildly different conclusion
Sure. I mean, that seems like a meaningfully weaker generalization, but sure. That’s not the main issue.
Here’s how the whole situation looks like from my perspective:
We don’t know how generally-intelligent entities like humans work, what the general-intelligence capability is entangled with.
Our only reference point is humans. Human exhibit a lot of dangerous properties, like deceptiveness and consequentialist-like reasoning that seems to be able to disregard contextually-learned values.
There are some gears-level models that suggest intelligence is necessarily entangled with deception-ability (e. g., mine), and some gears-level models that suggest it’s not (e. g., yours). Overall, we have no definitive evidence either way. We have not reverse-engineered any generally-intelligent entities.
We have some insight into how SOTA AIs work. But SOTA AIs are not generally intelligent. Whatever safety assurances our insights into SOTA AIs give us, do not necessarily generalize to AGI.
SOTA AIs are, nevertheless, superhuman at some tasks at which we’ve managed to get them working so far. By volume, GPT-4 can outperform teams of coders, and Midjourney is putting artists out of business. The hallucinations are a problem, but if it were gone, they’d plausibly wipe out whole industries.
An AI that outperforms humans at deception and strategy by the same margin as GPT-4/Midjourney outperform them at writing/coding/drawing would plausibly be an extinction-level threat.
The AI industry leaders are purposefully trying to build a generally-intelligent AI.
The AI industry leaders are not rigorously checking every architectural tweak or cute AutoGPT setup to ensure that it’s not going to give their model room to develop deceptive alignment and other human-like issues.
Summing up: There’s reasonable doubt regarding whether AGIs would necessarily be deception-capable. Highly deception-capable AGIs would plausibly be an extinction risk. The AI industry is currently trying to blindly-but-purposefully wander in the direction of AGI.
Even shorter: There’s a plausible case that, on its current course, the AI industry is going to generate an extinction-capable AI model.
There are no ironclad arguments against that, unless you buy into your inside-view model of generally-intelligent cognition as hard as I buy into mine.
And what you effectively seem to be saying is “until you can rigorously prove that AGIs are going to develop dangerous extinction-level capabilities, it is totally fine to continue blindly scaling and tinkering with architectures”.
What I’m saying is “until you can rigorously prove that a given scale-up plus architectural tweak isn’t going to result in a superhuman extinction-enthusiastic AGI, you should not be allowed to test that empirically”.
Yes, “prove that this technological advance isn’t going to kill us all or you’re not allowed to do it” is a ridiculous standard to apply in the general case. But in this one case, there’s a plausible-enough argument that it might, and that argument has not actually been soundly refuted by our getting some insight into how LLMs work and coming up with a theory of their cognition.
And what you effectively seem to be saying is “until you can rigorously prove that AGIs are going to develop dangerous extinction-level capabilities, it is totally fine to continue blindly scaling and tinkering with architectures”.
No, I am in fact quite worried about the situation and think there is a 5-15% chance of huge catastrophe on the current course! But I think these AGIs won’t be within-forward-pass deceptively aligned, and instead their agency will eg come from scaffolding-like structures. I think that’s important. I think it’s important that we not eg anchor on old speculation about AIXI or within-forward-pass deceptive-alignment or whatever, and instead consider more realistic threat models and where we can intervene. That doesn’t mean it’s fine and dandy to keep scaling with no concern at all.
The reason my percentage is “only 5 to 15” is because I expect society and firms to deal with these problems as they come up, and for that to generalize pretty well to the next step of experimentation and capabilities advancements; for systems to remain tools until invoked into agents; etc.
(Hopefully this comment of mine clarifies; it feels kinda vague to me.)
What I’m saying is “until you can rigorously prove that a given scale-up plus architectural tweak isn’t going to result in a superhuman extinction-enthusiastic AGI, you should not be allowed to test that empirically”.
No, I am in fact quite worried about the situation
Fair, sorry. I appear to have been arguing with my model of someone holding your general position, rather than with my model of you.
I think these AGIs won’t be within-forward-pass deceptively aligned, and instead their agency will eg come from scaffolding-like structures
Would you outline your full argument for this and the reasoning/evidence backing that argument?
To restate: My claim is that, no matter much empirical evidence we have regarding LLMs’ internals, until we have either an AGI we’ve empirically studied or a formal theory of AGI cognition, we cannot say whether shard-theory-like or classical-agent-like views on it will turn out to have been correct. Arguably, both side of the debate have about the same amount of evidence: generalizations from maybe-valid maybe-not reference classes (humans vs. LLMs) and ambitious but non-rigorous mechanical theories of cognition (the shard theory vs. coherence theorems and their ilk stitched into something like my model).
How about diffusion planning as a model? Or dreamerv3? If LLMs are the only model you’ll consider, you have blinders on. The core of the threat model is easily demonstrated with RL-first models, and while certainly LLMs are in the lead right now, there’s no strong reason to believe the humans trying to make the most powerful AI will continue to use architectures limited by the slow speed of RLHF.
Certainly I don’t think the original foom expectations were calibrated. Deep learning should have been obviously going to win since at least 2015. But that doesn’t mean there’s no place for a threat model that looks like long term agency models, all that takes to model is long horizon diffusion planning. Agency also comes up more the more RL you do. You added an eye roll react to my comment that RLHF is safety washing, but do you really think we’re in a place where the people providing the RL feedback can goalcraft AI in a way that will be able to prevent humans from getting gentrified out of the economy? That’s just the original threat model but a little slower. So yeah, maybe there’s stuff to push back on. But don’t make your conceptual brush size too big when you push back. Predictable architectures are enough to motivate this line of reasoning.
“under what conditions does speculation about ‘superintelligent consequentialism’ merit research attention at all?”
Under the conditions of relevant concepts and the future being confusing. Using real systems (both AIs and humans) to anchor theory is valuable, but so is blue sky theory that doesn’t care about currently available systems and investigates whatever hasn’t been investigated yet and seems to make sense, when there are ideas to formulate or problems to solve, regardless of their connection to reality. A lot of math doesn’t care about applications, and it might take decades to stumble on some use for a small fraction of it (even as it’s not usually the point).
FWIW I did not interpret Thane as necessarily having “high confidence” in “architecture / internal composition” of AGI. It seemed to me that they were merely (and ~accurately) describing what the canonical views were most worried about. (And I think a discussion about whether or not being able to “model the world” counts as a statement about “internal composition” is sort of beside the point/beyond the scope of what’s really being said)
It’s fair enough if you would say things differently(!) but in some sense isn’t it just pointing out: ‘I would emphasize different aspects of the same underlying basic point’. And I’m not sure if that really progresses the discussion? I.e. it’s not like Thane Ruthenis actually claims that “scarily powerful artificial agents” currently exist. It is indeed true that they don’t exist and may not ever exist. But that’s just not really the point they are making so it seems reasonable to me that they are not emphasizing it.
----
I’d like to see justification of “under what conditions does speculation about ‘superintelligent consequentialism’ merit research attention at all?” and “why do we think ‘future architectures’ will have property X, or whatever?!”.
I think I would also like to see more thought about this. In some ways, after first getting into the general area of AI risk, I was disappointed that the alignment/safety community was not more focussed on questions like this. Like a lot of people, I’d been originally inspired by Superintelligence—significant parts of which relate to these questions imo—only to be told that the community had ‘kinda moved away from that book now’. And so I sort of sympathize with the vibe of Thane’s post (and worry that there has been a sort of mission creep)
“why do we think ‘future architectures’ will have property X, or whatever?!”.
This is the biggest problem with a lot of AI risk stuff, and it’s the gleeful assuming that AIs have certain properties, and it’s one of my biggest issues with the post, in that with a few exceptions, it assumes that real AGIs or future AGIs will confidently have certain properties, when there is not much reason to make the strong assumptions that Thane Ruthenis does on AI safety, and I’m annoyed by this occurring extremely often.
it assumes that real AGIs or future AGIs will confidently have certain properties like having deceptive alignment
The post doesn’t claim AGIs will be deceptive aligned, it claims that AGIs will be capable of implementing deceptive alignment due to internally doing large amounts of consequentialist-y reasoning. This seems like a very different claim. This claim might also be false (for reasons I discuss in the second bullet point of this comment), but it’s importantly different and IMO much more defensible.
I’d like to see justification of “under what conditions does speculation about ‘superintelligent consequentialism’ merit research attention at all?” and “why do we think ‘future architectures’ will have property X, or whatever?!”.
One of my mental models for alignment work is “contingency planning”. There are a lot of different ways AI research could go. Some might be dangerous. Others less so. If we can forecast possible dangers in advance, we can try to steer towards safer designs, and generate contingency plans with measures to take if a particular forecast for AI development ends up being correct.
The risk here is “person with a hammer” syndrome, where people try to apply mental models from thinking about superintelligent consequentialists to other AI systems in a tortured way, smashing round pegs into square holes. I wish people would look at the territory more, and do a little bit more blue sky security thinking about unknown unknowns, instead of endlessly trying to apply the classic arguments even when they don’t really apply.
A specific research proposal would be: Develop a big taxonomy or typology of how AGI could work by identifying the cruxes researchers have, then for each entry in your typology, give it an estimated safety rating, try to identify novel considerations which apply to it, and also summarize the alignment proposals which are most promising for that particular entry.
It seems to me that you have very high confidence in being able to predict the “eventual” architecture / internal composition of AGI. I don’t know where that apparent confidence is coming from.
I would instead say:
We have to remember that there is AI which we know can exist (LLMs) and there is first-principles speculation about what AGI might look like (which may or may not be realized). And so rather than justifying “does current evidence apply to ‘superintelligences’?”, I’d like to see justification of “under what conditions does speculation about ‘superintelligent consequentialism’ merit research attention at all?” and “why do we think ‘future architectures’ will have property X, or whatever?!”.
The views might have, for example, fundamentally misunderstood how cognition and motivation work (anyone remember worrying about ‘but how do I get an AI to rescue my mom from a burning building, without specifying my whole set of values’?).
I disagree that it is actually “first-principles”. It is based on generalizing from humans, and on the types of entities (idealized utility-maximizing agents) that humans could be modeled as approximating in specific contexts in which they steer the world towards their goals most powerfully.
As I’d tried to outline in the post, I think “what are AIs that are known to exist, and what properties do they have?” is just the wrong question to focus on. The shared “AI” label is a red herring. The relevant question is “what are scarily powerful generally-intelligent systems that exist, and what properties do they have?”, and the only relevant data point seems to be humans.
And as far as omnicide risk is concerned, the question shouldn’t be “how can you prove these systems will have the threatening property X, like humans do?” but “how can you prove these systems won’t have the threatening property X, like humans do?”.
Yeah, but if you generalize from humans another way (“they tend not to destroy the world and tend to care about other humans”), you’ll come to a wildly different conclusion. The conclusion should not be sensitive to poorly motivated reference classes and frames, unless it’s really clear why we’re using one frame. This is a huge peril of reasoning by analogy.
Whenever attempting to draw conclusions by analogy, it’s important that there be shared causal mechanisms which produce the outcome of interest. For example, I can simulate a spring using an analog computer because both systems are roughly governed by similar differential equations. In shard theory, I posited that there’s a shared mechanism of “local updating via self-supervised and TD learning on ~randomly initialized neural networks” which leads to things like “contextually activated heuristics” (or “shards”).
Here, it isn’t clear what the shared mechanism is supposed to be, such that both (future) AI and humans have it. Suppose I grant that if a system is “smart” and has “goals”, then bad things can happen. Let’s call that the “bad agency” hypothesis.
But how do we know that future AI will have the relevant cognitive structures for “bad agency” to be satisfied? How do we know that the AI will have internal goal representations which chain into each other across contexts, so that the AI reliably pursues one or more goals over time? How do we know that the mechanisms are similar enough for the human->AI analogy to provide meaningful evidence on this particular question?
I expect there to be “bad agency” systems eventually, but it really matters what kind we’re talking about. If you’re thinking of “secret deceptive alignment that never externalizes in the chain-of-thought” and I’m thinking about “scaffolded models prompted to be agentic and bad”, then our interventions will be wildly different.
Sure. I mean, that seems like a meaningfully weaker generalization, but sure. That’s not the main issue.
Here’s how the whole situation looks like from my perspective:
We don’t know how generally-intelligent entities like humans work, what the general-intelligence capability is entangled with.
Our only reference point is humans. Human exhibit a lot of dangerous properties, like deceptiveness and consequentialist-like reasoning that seems to be able to disregard contextually-learned values.
There are some gears-level models that suggest intelligence is necessarily entangled with deception-ability (e. g., mine), and some gears-level models that suggest it’s not (e. g., yours). Overall, we have no definitive evidence either way. We have not reverse-engineered any generally-intelligent entities.
We have some insight into how SOTA AIs work. But SOTA AIs are not generally intelligent. Whatever safety assurances our insights into SOTA AIs give us, do not necessarily generalize to AGI.
SOTA AIs are, nevertheless, superhuman at some tasks at which we’ve managed to get them working so far. By volume, GPT-4 can outperform teams of coders, and Midjourney is putting artists out of business. The hallucinations are a problem, but if it were gone, they’d plausibly wipe out whole industries.
An AI that outperforms humans at deception and strategy by the same margin as GPT-4/Midjourney outperform them at writing/coding/drawing would plausibly be an extinction-level threat.
The AI industry leaders are purposefully trying to build a generally-intelligent AI.
The AI industry leaders are not rigorously checking every architectural tweak or cute AutoGPT setup to ensure that it’s not going to give their model room to develop deceptive alignment and other human-like issues.
Summing up: There’s reasonable doubt regarding whether AGIs would necessarily be deception-capable. Highly deception-capable AGIs would plausibly be an extinction risk. The AI industry is currently trying to blindly-but-purposefully wander in the direction of AGI.
Even shorter: There’s a plausible case that, on its current course, the AI industry is going to generate an extinction-capable AI model.
There are no ironclad arguments against that, unless you buy into your inside-view model of generally-intelligent cognition as hard as I buy into mine.
And what you effectively seem to be saying is “until you can rigorously prove that AGIs are going to develop dangerous extinction-level capabilities, it is totally fine to continue blindly scaling and tinkering with architectures”.
What I’m saying is “until you can rigorously prove that a given scale-up plus architectural tweak isn’t going to result in a superhuman extinction-enthusiastic AGI, you should not be allowed to test that empirically”.
Yes, “prove that this technological advance isn’t going to kill us all or you’re not allowed to do it” is a ridiculous standard to apply in the general case. But in this one case, there’s a plausible-enough argument that it might, and that argument has not actually been soundly refuted by our getting some insight into how LLMs work and coming up with a theory of their cognition.
No, I am in fact quite worried about the situation and think there is a 5-15% chance of huge catastrophe on the current course! But I think these AGIs won’t be within-forward-pass deceptively aligned, and instead their agency will eg come from scaffolding-like structures. I think that’s important. I think it’s important that we not eg anchor on old speculation about AIXI or within-forward-pass deceptive-alignment or whatever, and instead consider more realistic threat models and where we can intervene. That doesn’t mean it’s fine and dandy to keep scaling with no concern at all.
The reason my percentage is “only 5 to 15” is because I expect society and firms to deal with these problems as they come up, and for that to generalize pretty well to the next step of experimentation and capabilities advancements; for systems to remain tools until invoked into agents; etc.
(Hopefully this comment of mine clarifies; it feels kinda vague to me.)
But I do think this is way too high of a bar.
Fair, sorry. I appear to have been arguing with my model of someone holding your general position, rather than with my model of you.
Would you outline your full argument for this and the reasoning/evidence backing that argument?
To restate: My claim is that, no matter much empirical evidence we have regarding LLMs’ internals, until we have either an AGI we’ve empirically studied or a formal theory of AGI cognition, we cannot say whether shard-theory-like or classical-agent-like views on it will turn out to have been correct. Arguably, both side of the debate have about the same amount of evidence: generalizations from maybe-valid maybe-not reference classes (humans vs. LLMs) and ambitious but non-rigorous mechanical theories of cognition (the shard theory vs. coherence theorems and their ilk stitched into something like my model).
Would you disagree? If yes, how so?
How about diffusion planning as a model? Or dreamerv3? If LLMs are the only model you’ll consider, you have blinders on. The core of the threat model is easily demonstrated with RL-first models, and while certainly LLMs are in the lead right now, there’s no strong reason to believe the humans trying to make the most powerful AI will continue to use architectures limited by the slow speed of RLHF.
Certainly I don’t think the original foom expectations were calibrated. Deep learning should have been obviously going to win since at least 2015. But that doesn’t mean there’s no place for a threat model that looks like long term agency models, all that takes to model is long horizon diffusion planning. Agency also comes up more the more RL you do. You added an eye roll react to my comment that RLHF is safety washing, but do you really think we’re in a place where the people providing the RL feedback can goalcraft AI in a way that will be able to prevent humans from getting gentrified out of the economy? That’s just the original threat model but a little slower. So yeah, maybe there’s stuff to push back on. But don’t make your conceptual brush size too big when you push back. Predictable architectures are enough to motivate this line of reasoning.
Under the conditions of relevant concepts and the future being confusing. Using real systems (both AIs and humans) to anchor theory is valuable, but so is blue sky theory that doesn’t care about currently available systems and investigates whatever hasn’t been investigated yet and seems to make sense, when there are ideas to formulate or problems to solve, regardless of their connection to reality. A lot of math doesn’t care about applications, and it might take decades to stumble on some use for a small fraction of it (even as it’s not usually the point).
FWIW I did not interpret Thane as necessarily having “high confidence” in “architecture / internal composition” of AGI. It seemed to me that they were merely (and ~accurately) describing what the canonical views were most worried about. (And I think a discussion about whether or not being able to “model the world” counts as a statement about “internal composition” is sort of beside the point/beyond the scope of what’s really being said)
It’s fair enough if you would say things differently(!) but in some sense isn’t it just pointing out: ‘I would emphasize different aspects of the same underlying basic point’. And I’m not sure if that really progresses the discussion? I.e. it’s not like Thane Ruthenis actually claims that “scarily powerful artificial agents” currently exist. It is indeed true that they don’t exist and may not ever exist. But that’s just not really the point they are making so it seems reasonable to me that they are not emphasizing it.
----
I think I would also like to see more thought about this. In some ways, after first getting into the general area of AI risk, I was disappointed that the alignment/safety community was not more focussed on questions like this. Like a lot of people, I’d been originally inspired by Superintelligence—significant parts of which relate to these questions imo—only to be told that the community had ‘kinda moved away from that book now’. And so I sort of sympathize with the vibe of Thane’s post (and worry that there has been a sort of mission creep)
This is the biggest problem with a lot of AI risk stuff, and it’s the gleeful assuming that AIs have certain properties, and it’s one of my biggest issues with the post, in that with a few exceptions, it assumes that real AGIs or future AGIs will confidently have certain properties, when there is not much reason to make the strong assumptions that Thane Ruthenis does on AI safety, and I’m annoyed by this occurring extremely often.
The post doesn’t claim AGIs will be deceptive aligned, it claims that AGIs will be capable of implementing deceptive alignment due to internally doing large amounts of consequentialist-y reasoning. This seems like a very different claim. This claim might also be false (for reasons I discuss in the second bullet point of this comment), but it’s importantly different and IMO much more defensible.
I was just wrong here, apparently, I misread what Thane Ruthenis is saying, and I’m not sure what to do with my comment up above.
One of my mental models for alignment work is “contingency planning”. There are a lot of different ways AI research could go. Some might be dangerous. Others less so. If we can forecast possible dangers in advance, we can try to steer towards safer designs, and generate contingency plans with measures to take if a particular forecast for AI development ends up being correct.
The risk here is “person with a hammer” syndrome, where people try to apply mental models from thinking about superintelligent consequentialists to other AI systems in a tortured way, smashing round pegs into square holes. I wish people would look at the territory more, and do a little bit more blue sky security thinking about unknown unknowns, instead of endlessly trying to apply the classic arguments even when they don’t really apply.
A specific research proposal would be: Develop a big taxonomy or typology of how AGI could work by identifying the cruxes researchers have, then for each entry in your typology, give it an estimated safety rating, try to identify novel considerations which apply to it, and also summarize the alignment proposals which are most promising for that particular entry.