At first I thought oh no, Connor is planning to Start Over and Do It Right. And people will follow him because Connor is awesome. And then we’re more likely to all die, because there isn’t time to start over and nobody has a plan to stop progress toward ASI on the current route.
Then I saw the section on Voyager. Great, I thought; Connor is going to make a better way to create language model agents with legible and faithful chains of thought (in structured code). Those seem like our best route to survival, since they’re as alignable as we’re going to get, and near the default trajectory. Conjecture’s previous attempt to make LMA COEMs seemed like a good idea. Hopefully everyone else is hitting walls just as fast, but an improved technique can still beat training for complex thought in ways that obscure the chain of thought.
Then I saw implications that those steps are going to take a lot of time, and the note that of course we die if we go straight to AGI. Oh dang, back to thought #1: Connor will be building perfect AGI while somebody else rushes to get there first, and a bunch of other capable, motivated people are going to follow him instead of trying to align the AGI we are very likely going to get.
The narrow path you reference is narrow indeed. I have not heard a single plan for pause that comes to grips with how difficult it would be to enforce internationally, and the consequences of Russia and China pushing for AGI while we pause. It might be our best chance, but we haven’t thought it through.
So I for one really wish Connor and Conjecture would put their considerable talents toward what seems to me like a far better “out:” the possibility that founation model-based AGI can be aligned well enough to follow instructions even without perfect alignment or a perfectly faithful chain of thought.
I realize you currently think this isn’t likely to work, but I can’t find a single place where this discussion is carried to its conclusion. It really looks like we simply don’t know yet. All of the discussions break down into appeals to intuition and frustration with those from the “opposing camp” (of the nicest, most rational sort when they’re on LW). And how likely would it have to be to beat the slim chance we can pause ASI development for long enough?
This is obviously a longer discussion, but I’ll make just one brief point about why that might be more likely than many assume. You appear to be assuming (I’m sure with a good bit of logic behind it) that we need our AGI to be highly aligned for success—that if network foundation models do weird things sometimes, that will be our doom.
Making a network into real AGI that’s reflective, agentic, and learns continuously introduces some new problems. But it also introduces a push toward coherence. Good humans can have nasty thoughts and not act on them or be corrupted by them. A coherent entity might need to be only 51% aligned, not the 99.99% you’re shooting for. Particularly if that alignment is strictly toward following instructions, so there’s corrigibility and a human in the loop.
Edit: it seems like a strategy could split the difference by doing what you’re describing, but accelerating much faster if you thought agent coherence could take care of some alignment slop.
I for one don’t want to die while sticking to principles and saying I told you so when we’re close to doom; I want to take our best odds of survival—which seems to include really clarifying which problems we need to solve.
Thanks for the comment! I agree that we live in a highly suboptimal world, and I do not think we are going to make it, but it’s worth taking our best shot.
I don’t think of the CoEm agenda as “doing AGI right.” (for one, it is not even an agenda for building AGI/ASI, but of bounding ourselves below that) Doing AGI right would involve solving problems like P vs PSPACE, developing vastly more deep understanding of Algorithmic Information Theory and more advanced formal verification of programs. If I had infinite budget and 200 years, the plan would look very different, and I would feel very secure in humanity’s future.
Alas, I consider CoEm an instance of a wider class of possible alignment plans that I consider the “bare minimum for Science to work.” I generally think any plans more optimistic than this require some other external force of things going well, which might be empirical facts about reality (LLMs are just nice because of some deep pattern in physics) or metaphysics (there is an actual benevolent creator god intervening specifically to make things go well, or Anthropic Selection is afoot). Many of the “this is what we will get, so we have to do this” type arguments just feel like cope to me, rather than first principles thinking of “if my goal is a safe AI system, what is the best plan I can come up with that actually outputs safe AI at the end?”, reactive vs constructive planning. Of course, in the real world, it’s tradeoffs all the way down, and I know this. You can read some of my thoughts about why I think alignment is hard and current plans are not on track here.
I don’t consider this agenda to be maximally principled or aesthetically pleasing, quite the opposite, it feels like a grubby engineering compromise that simply has a minimum requirement to actually do science in a non-insane way. There are of course various even more compromising positions, but I think those simply don’t work in the real world. I think the functionalist/teleological/agent based frameworks that are currently being applied to alignment work on LW are just too confused to ever really work in the real world, the same way how I think that the models of alchemy just can never actually get you to a safe nuclear reactor and you need to at least invent calculus (or hell at least better metallurgy!) and do actual empiricism and stuff.
As for pausing and governance, I think governance is another mandatory ingredient to a good outcome, most of the work there I am involved with happens through ControlAI and their plan “A Narrow Path”. I am under no illusion that these political questions are easy to solve, but I do believe they are possible and necessary to solve, and I have a lot of illegible inside info and experience here that doesn’t fit into a LW comment. If there is no mechanism by which reckless actors are prevented from killing everyone else by building doomsday machines, we die. All the technical alignment research in the world is irrelevant to this point. (And “pivotal acts” are an immoral pipedream)
At first I thought oh no, Connor is planning to Start Over and Do It Right. And people will follow him because Connor is awesome. And then we’re more likely to all die, because there isn’t time to start over and nobody has a plan to stop progress toward ASI on the current route.
Then I saw the section on Voyager. Great, I thought; Connor is going to make a better way to create language model agents with legible and faithful chains of thought (in structured code). Those seem like our best route to survival, since they’re as alignable as we’re going to get, and near the default trajectory. Conjecture’s previous attempt to make LMA COEMs seemed like a good idea. Hopefully everyone else is hitting walls just as fast, but an improved technique can still beat training for complex thought in ways that obscure the chain of thought.
Then I saw implications that those steps are going to take a lot of time, and the note that of course we die if we go straight to AGI. Oh dang, back to thought #1: Connor will be building perfect AGI while somebody else rushes to get there first, and a bunch of other capable, motivated people are going to follow him instead of trying to align the AGI we are very likely going to get.
The narrow path you reference is narrow indeed. I have not heard a single plan for pause that comes to grips with how difficult it would be to enforce internationally, and the consequences of Russia and China pushing for AGI while we pause. It might be our best chance, but we haven’t thought it through.
So I for one really wish Connor and Conjecture would put their considerable talents toward what seems to me like a far better “out:” the possibility that founation model-based AGI can be aligned well enough to follow instructions even without perfect alignment or a perfectly faithful chain of thought.
I realize you currently think this isn’t likely to work, but I can’t find a single place where this discussion is carried to its conclusion. It really looks like we simply don’t know yet. All of the discussions break down into appeals to intuition and frustration with those from the “opposing camp” (of the nicest, most rational sort when they’re on LW). And how likely would it have to be to beat the slim chance we can pause ASI development for long enough?
This is obviously a longer discussion, but I’ll make just one brief point about why that might be more likely than many assume. You appear to be assuming (I’m sure with a good bit of logic behind it) that we need our AGI to be highly aligned for success—that if network foundation models do weird things sometimes, that will be our doom.
Making a network into real AGI that’s reflective, agentic, and learns continuously introduces some new problems. But it also introduces a push toward coherence. Good humans can have nasty thoughts and not act on them or be corrupted by them. A coherent entity might need to be only 51% aligned, not the 99.99% you’re shooting for. Particularly if that alignment is strictly toward following instructions, so there’s corrigibility and a human in the loop.
Some of this is in Internal independent review for language model agent alignment and Instruction-following AGI is easier and more likely than value aligned AGI, but I haven’t made that coherence point clearly. I think it’s another Crux of disagreement on alignment difficulty that I missed in that writup—and one that hasn’t been resolved.
Edit: it seems like a strategy could split the difference by doing what you’re describing, but accelerating much faster if you thought agent coherence could take care of some alignment slop.
I for one don’t want to die while sticking to principles and saying I told you so when we’re close to doom; I want to take our best odds of survival—which seems to include really clarifying which problems we need to solve.
Thanks for the comment! I agree that we live in a highly suboptimal world, and I do not think we are going to make it, but it’s worth taking our best shot.
I don’t think of the CoEm agenda as “doing AGI right.” (for one, it is not even an agenda for building AGI/ASI, but of bounding ourselves below that) Doing AGI right would involve solving problems like P vs PSPACE, developing vastly more deep understanding of Algorithmic Information Theory and more advanced formal verification of programs. If I had infinite budget and 200 years, the plan would look very different, and I would feel very secure in humanity’s future.
Alas, I consider CoEm an instance of a wider class of possible alignment plans that I consider the “bare minimum for Science to work.” I generally think any plans more optimistic than this require some other external force of things going well, which might be empirical facts about reality (LLMs are just nice because of some deep pattern in physics) or metaphysics (there is an actual benevolent creator god intervening specifically to make things go well, or Anthropic Selection is afoot). Many of the “this is what we will get, so we have to do this” type arguments just feel like cope to me, rather than first principles thinking of “if my goal is a safe AI system, what is the best plan I can come up with that actually outputs safe AI at the end?”, reactive vs constructive planning. Of course, in the real world, it’s tradeoffs all the way down, and I know this. You can read some of my thoughts about why I think alignment is hard and current plans are not on track here.
I don’t consider this agenda to be maximally principled or aesthetically pleasing, quite the opposite, it feels like a grubby engineering compromise that simply has a minimum requirement to actually do science in a non-insane way. There are of course various even more compromising positions, but I think those simply don’t work in the real world. I think the functionalist/teleological/agent based frameworks that are currently being applied to alignment work on LW are just too confused to ever really work in the real world, the same way how I think that the models of alchemy just can never actually get you to a safe nuclear reactor and you need to at least invent calculus (or hell at least better metallurgy!) and do actual empiricism and stuff.
As for pausing and governance, I think governance is another mandatory ingredient to a good outcome, most of the work there I am involved with happens through ControlAI and their plan “A Narrow Path”. I am under no illusion that these political questions are easy to solve, but I do believe they are possible and necessary to solve, and I have a lot of illegible inside info and experience here that doesn’t fit into a LW comment. If there is no mechanism by which reckless actors are prevented from killing everyone else by building doomsday machines, we die. All the technical alignment research in the world is irrelevant to this point. (And “pivotal acts” are an immoral pipedream)