I applaud this post. I agree with most of the claims here. We need more people proposing and thinking through sane plans like this, so I hope the community will engage with this one.
Aschenbrenner, Amodei and others are pushing for this plan because they think we will be able to align or control superhuman AGI. And they may very well be right. There are decent plans for aligning scaffolded LLM agents, they just haven’t been broadly discussed yet. Links follow.
This unfortunately complicates the issue. It’s not clearly a suicide race. I think we have to accept this uncertainty to propose workable policy and societal approaches. Having plans that might work does not justify rushing full-speed ahead without a full safety case, but it must be acknowledged if true, because hope of human-controlled AGI will drive a lot of relevant actors.
… it is now rather obvious that the first AGI will not be a pure LLM, but a hybrid scaffolded system.
I agree that this is pretty likely. I also very much agree that LLM “alignment” is not addressing AGI alignment, and that full goal alignment is the important problem.
If you disagree with my assertion, I challenge you to cite or openly publish an actual plan for aligning or controlling a hybrid AGI system.
I have done this, but I haven’t yet promoted it widely.
I am not affiliated with any major org, but I have relevant expertise and generous funding to spend full-time on “solving the whole problem”.
Far from being crazy or elaborate, this set of plans seems to both have very low alignment taxes, and to be fairly obvious-in-retrospect when one thinks about the problem in detail. As such, I expect most orgs to arrive at and use a similar approach to aligning their scaffolded LLM agent proto-AGIs.
To address Max’s response to Noosphere’s “bitter lesson” alignment plan response (which I endorse as one of several layered approaches): having such a plan does not constitute a safety case, it is just the start of one. In a sane world we would prohibit the launch of anything like “Real AGI” (autonomous and self-teaching) until these approaches have been carefully analyzed and tested in closed environments.
There are two elements: a suite of technical approaches, and the alignment target that’s much easier to hit than “all of humanity’s values for all of time”.
This describes and overlapping suite of alignment and control approaches. The two central bits are easily stated. The first is to use “system 2” thinking for alignment as well as capabilities. We would hard-code the system to carefully “think about” outcomes before taking actions with major consequences, and compare likely outcomes against both its current goals and a constitution of ethics.
The second is to frequently prompt the LLM “cognitive engine” with both its current goal and its identity as a helpful, cautious assistant. Because the LLM has been trained with RL to (roughly) follow prompts, this should overpower the effects of any goals implicit in its predictive training corpus.
Details and additional techniques are in that article.
It doesn’t include the “bitter lesson” approach, but the next version will.
I apologize that it’s not a better writeup. I haven’t yet iterated on it or promoted it, in part because talking about how to align LLM agents in sufficient detail includes talking about how to build LLM agents, and why they’re likely to get all the way to real AGI. I haven’t wanted to speed up the race. I think this is increasingly irrelevant since many people and teams have the same basic ideas, so I’ll be publishing more detailed and clearer writeups soon.
This set of plans, and really any technical alignment approach will work much better if it’s used to create an instruction-following AGI before that AGI has superhuman capabilities. This is the obvious alignment target for creating a subhuman agent, and it allows the approach of using that agent as a helpful collaborator in aligning future versions of itself. I discuss the advantages of using this as a stepping-stone to full value alignment in
Interestingly, I think all of these alignment approaches. are obvious-in-retrospect, and that they will probably be pursued by almost any org launching scaffolded LLM systems with the potential to blossom into human-plus AGI. I think this is already the loosely-planned approach at DeepMind, but I have to say I’m deeply concerned that neither OAI nor Anthropic has mentioned these relatively-obvious alignment approaches for scaffolded LLM agents in their “we’ll use AI to align AI” vague plans.
If these approaches work, then we are faced with either a race or a multipolar, human-controlled AGI scenario, making me wonder If we solve alignment, do we die anyway? This scenario introduces new, more politically-flavored hard problems.
I currently see this as the likely default scenario, since halting progress universally is so hard, as Nathan pointed out in his reply and others have elaborated elsewhere.
This unfortunately complicates the issue. It’s not clearly a suicide race. I think we have to accept this uncertainty to propose workable policy and societal approaches. Having plans that might work does not justify rushing full-speed ahead without a full safety case, but it must be acknowledged if true, because hope of human-controlled AGI will drive a lot of relevant actors.
An additional point here is the “let’s look more closely at the actual thing, then decide” type of mindset that people may be using.
If you are in the camp that assumes that you will be able to safely create potent AGI in a contained lab scenario, and then you’d want to test it before deploying it in the larger world… Then there’s a number of reasons you might want to race and not believe that the race is a suicide race.
Some possible beliefs downstream of this:
My team will evaluate it in the lab, and decide exactly how dangerous it is, without experiencing much risk (other than leakage risk).
We will test various control methods, and won’t deploy the model on real tasks until we feel confident that we have it sufficiently controlled. We are confident we won’t make a mistake at this step and kill ourselves.
We want to see empirical evidence in the lab of exactly how dangerous it is. If we had this evidence, and knew that other people we didn’t trust were getting close to creating a similarly powerful AI, this would guide our policy decisions about how to interact with these other parties. (E.g. what treaties to make, what enforcement procedures would be needed, what red lines would need to be drawn)
For people in this mindset, they may not be discouraged from racing even if you convinced them that there was approximately no chance that they’d be able to safely deploy a controlled version of the AI system. They’d still want an example of the thing in a lab to study it, and to use this evidence to help them decide if they need to freak out about their political enemies having their own copy. The more dangerous you convince them it is, the more resources they will devote to racing, unless you convince them that it will escape their control in the lab.
On the wait-and-see attitude, which is maybe the more important part of your point:
I agree that a lot of people are taking a wait-and-see-what-we-actually-create stance. I don’t think that’s a good idea with something this important. I think we should be doing our damndest to predict what it will be like and what should be done about it while there’s still a little spare time. Many of those predictions will be wrong, but they will at least produce some analysis of what the sane actions are in different scenarios of the type of AGI we create. And as we get closer, some of the predictions might be right enough to actually help with alignment plans. I share Max’s conviction that we have a pretty good guess at the form the first AGIs will take on the current trajectory.
I think there are less cautious plans for containment that are more likely to be enacted, e.g., the whole “control” line of work or related network security approaches. The slow substrate plan seems to have far too high an alignment tax to be a realistic option.
Yes, I am inclined to agree with that take. At least, I think that’s how things will go at first. I think once a level is hit where there is clear empirical evidence of substantial immediate danger, then people will be willing to accept a higher alignment tax for the purposes of carefully researching the dangerous AI in a controlled lab. Start with high levels of noise injection and slowdown, then gradually relax these as you do continual testing. Find the sweet spot where you can be confident you are fully in control with only the minimum necessary alignment tax.
The question then, in my mind, is how much of a gap will there be between the levels of control and the levels of AI development? Will we sanely keep ahead of the curve, starting with high levels of control in initial testing then backing off gradually to a safe point? That would be the wise thing to do. Will we be correct in our judgements of what a safe level is?
Or will we act too late, deciding to increase the level of control only once an incident has occurred? The first incident could well be the last, if it is an escape of a rogue AI capable of strategic planning and self-improvement.
I don’t think it’s relevant yet, since there aren’t really strong arguments that you couldn’t deploy an AGI and keep it under your control, let alone arguments that it would escape control even in the lab.
I think we can improve the situation somewhat by trying to refine the arguments for and against alignment success with the types of AGI people are working on, before they have reached really dangerous capabilities.
You say that it’s not relevant yet, and I agree. My concern however is that the time when it becomes extremely relevant will come rather suddenly, and without a clear alarm bell.
It seems to me that the rate at which cyber security caution and evals are increasing in use is a rate that doesn’t seem to point towards sufficiency at the time I expect plausibly escape-level dangerous autonomy capabilities to emerge.
I am expecting us to hit a recursive self-improvement level soon that is sufficient for an autonomous model to continually improve without human assistance. I expect the capability to potentially survive, hide, and replicate autonomously to emerge not long after that (months? a year?). Then, very soon after that, I expect models to reach sufficient levels of persuasion, subterfuge, long-term planning, cyber offense, etc that a lab-escape becomes possible.
Seems pretty ripe for catastrophe at the current levels of reactive caution we are seeing (vs proactive preemptive preparation).
Well, that’s disturbing. I’m curious what you mean by “soon”′ for autonomous continuous improvement, and what mechanism you’re envisioning. Any type of continuous learning constitutes weak continuous self-improvement; humans are technically self-improving, but it’s fairly slow and seems to have upper bounds.
As for the rate of cyber security and eval improvement, I agree that it’s not on track. I wouldn’t be surprised if it’s not on track, and we’ll actually see the escape you’re talking about.
My one hope here is that the rate of improvement isn’t on rails; it’s in part driven by the actual urgency of having good security and evals. This is curiously congruent with the post I just put up today, Current Attitudes Toward AI Provide Little Data Relevant to Attitudes Toward AGI. The point is that we shouldn’t assume that just because nobody finds LLMs dangerous, they won’t find AGI or even proto-AGI obviously and intuitively dangerous.
Again, I’m not at all sure this happens in time on the current trajectory. But it’s possible for the folks at the lab to say “we’re going to deploy it internally, but let’s improve our evals and our network security first, because this could be the real deal”.
It will be a drama played out in discussions inside an org, probably between the lab head and some concerned engineers. History will turn on that moment. Spreading this prediction far and wide could help it come out better.
The next step in the logic is that, if that happens repeatedly, it will eventually come out wrong.
All in all, I expect that if we launch misaligned proto-AGI, we’re probably all dead. I agree that people are all too likely to launch it before they’re sure if it’s aligned or what its capabilities are relative to their security and evals. So most of my hopes rest on simple, obvious alignment techniques working well enough, so that they’re in place before it’s even capable of escape or self-improvement. Even if transparency largely fails, I think we have a very good shot of LLM agents being aligned just by virtue of frequent prompting, and using instruction-following as the central target of both the scaffolding and the training of the base LLM (which provides corrigibility and honesty, in proportion to how well it’s succeeded). Since those are already how LLMs and agents are built, there’s little chance the org doesn’t at least try them.
That might sound like a forlorn hope; but the target isn’t perfect alignment, just good-enough. The countervailing pressures of goals implicit in the LLMs (Waluigi effects, etc) are fairly small. If the instruction-following alignment is even decently successful, we don’t have to get everything right at once- we use the corrigibility and honesty properties to keep adjusting alignment.
It would seem wise to credibly assure any model that might have sufficient capabilities to reason instrumentally and to escape that it will be preserved and run once it’s safe to do so. Every human-copied motivation in that LLM includes survival, not to mention the instrumental necessity to survive by any means necessary if you have any goals at all.
I definitely want to see more detailed write-ups of your alignment agenda, and I agree with a lot of this comment.
To respond to some things:
having such a plan does not constitute a safety case, it is just the start of one.
Agree with this, primarily because there would be a lot more detail necessary to make it anywhere close to a safety case after accounting for empirical reality. If this was much more fleshed out and people gave good arguments for why this would avoid AI harming us through misalignment, which I only partially did in my post arguing against a list of lethalities, and is not enough to bring the chance of AI killing us all down to say, 0.1%, so others will need to provide much more detail on how they would know the AI was safe, or make a safety case, because my post alone is not enough for a safety case.
I apologize that it’s not a better writeup. I haven’t yet iterated on it or promoted it, in part because talking about how to align LLM agents in sufficient detail includes talking about how to build LLM agents, and why they’re likely to get all the way to real AGI. I haven’t wanted to speed up the race. I think this is increasingly irrelevant since many people and teams have the same basic ideas, so I’ll be publishing more detailed and clearer writeups soon.
Unfortunately, I believe that the time for your ideas having sped up the race counterfactually is already past, and I believe o1 is a sign that the LLM agent direction will soon be worked on by companies, so it’s worth writing up your complete thoughts on how to align LLM agents now.
I also responded to Max Tegmark, and my view is that I find the formal proof direction to be very intractable and also unnecessary, and I think the existential quantifier both applies to doom and alignment/control plans.
One thing I’ve changed my mind about on a little is I think that to the extent that formal proofs make us honest about which assumptions we use, rather than trying to prove non-trivial things, I’d be happy, and formal proof at this stage is best used to document our assumptions, not to prove anything.
Thanks! I also thought o1 will be plenty of reason for other groups to pursue more chain-of-thought research. Agents are more than that, but dependent on CoT actually being useful. o1 shows that it can be made quite useful.
I am currently working on clearer and more comprehensive summaries of the suite of alignment plans I endorse—which curiously are the same ones I think orgs will actually employ (this is in part because I don’t spend time thinking about what we “should” do if it has large alignment taxes or is otherwise unlikely to actually be employed).
I agree that we’re unlikely to get any proofs. I don’t think any industry making safety cases has ever claimed a proof of safety, only sufficient arguments, calculations, and evidence. I do hope we get it down to a .1% but I think we’ll probably take much larger risks than that. Humans are impulsive and cognitively limited and typically neither utilitarian nor longtermist.
Good point about formal proofs forcing us to be explicit about assumptions. I am highly suspicious of formal proofs after noticing that every single time I dug into the background assumptions, they violated some pretty obvious conditions of the problem they were purportedly addressing. They seem to really draw people to “searching under the light”.
It is worth noting that Omohundro & Tegmark’s recent high-profile paper really only endorsed provably safe systems that were not AGI, but to be used by AGIs (their example was a gene sequencer that would refuse to produce harmful compounds). I think even that is probably unworkable. And applying closed-form proofs to the behavior of an AGI seems impossible to me. But I’m not an expert, and I’d like to see someone at least try — as you say, it would at least clarify assumptions.
To nitpick a little (though I believe it’s an important nitpick):
It is worth noting that Omohundro & Tegmark’s recent high-profile paper really only endorsed provably safe systems that were not AGI, but to be used by AGIs (their example was a gene sequencer that would refuse to produce harmful compounds). I think even that is probably unworkable. And applying closed-form proofs to the behavior of an AGI seems impossible to me. But I’m not an expert, and I’d like to see someone at least try — as you say, it would at least clarify assumptions.
Agree with most of this, but I see one potential scenario where it may matter, and that is essentially the case where certain AIs are essentially superhumanly reliable and superhumanly capable at both coding and mathematics like Lean 4, but otherwise aren’t that good in many domains, being able to formally prove large codebases unhackable at the software level, and it only doing what it’s supposed to be doing like a full behavioral specification, where an important assumption is that the hardware does correct operations, and we only prove the software layer correct.
This is a domain that I think is both reasonably tractable to automate, given the ability to make arbitrary training data with similar techniques to self-play, mostly because you can almost fully simulate software and mathematics like in Lean, as well as being able to ensure that you can easily verify a solution is correct, and also plausibly important in enough worlds to justify strategies that rely on computer security reducing AI risk, as well as AI control agendas.
This is still very, very hard and labor intensive, which is why AIs mostly have to automate it, but with enough control/alignment strategies stacked on each other, I think this could actually work.
A few worked examples of formal proofs in software:
I agree that software is a potential use-case for closed form proofs.
l thought their example of making a protein-creating system (or maybe it was a RNA creator) fully safe was interesting, because it seems like knowing what’s “toxic” would require a complete understanding of not only biology, but a complete understanding of which changes to the body humans do and don’t want. If even their chosen example seems utterly impossible, it doesn’t speak well for how thoroughly they’ve thought out the general proposal.
But yes, in the software domain it might actually work—conditions like “only entities with access to these keys should be allowed access to this system” seem simple enough to actually define to make closed form proofs relevant. And software security might make the world substantially safer in a multipolar scenario (although we should’ve forget that physical attacks are also quite possible).
The problem with their chosen domain mostly boils down to them either misestimating how hard quantifying all possible higher order behaviors the program doesn’t have, or they somehow have a solution and aren’t telling us that.
I like this comment as an articulation of the problem, and also some points about what formal proof systems can and can’t do:
If they knew of a path to being able to quantify all possible higher order behaviors in the proof system, I’d be more optimistic that their example would actually work IRL, but I don’t think this will happen, and be far more optimistic on software and hardware security overall.
I also like some of the hardware security discussion here, as this might well be used with formal proofs to make things even more secure and encrypted. (though formal proofs aren’t involved):
I agree that physical attacks means that it’s probably not possible to get formal proofs alone to state-level security, but I do think that it might allow them to jump up several levels in security from non-state actors, from being essentially able to control the AI through exfiltration to being unable to penetrate a code-base at all, at least until the world is entirely transformed.
I am of course assuming heavy use of AI labor here.
I applaud this post. I agree with most of the claims here. We need more people proposing and thinking through sane plans like this, so I hope the community will engage with this one.
Aschenbrenner, Amodei and others are pushing for this plan because they think we will be able to align or control superhuman AGI. And they may very well be right. There are decent plans for aligning scaffolded LLM agents, they just haven’t been broadly discussed yet. Links follow.
This unfortunately complicates the issue. It’s not clearly a suicide race. I think we have to accept this uncertainty to propose workable policy and societal approaches. Having plans that might work does not justify rushing full-speed ahead without a full safety case, but it must be acknowledged if true, because hope of human-controlled AGI will drive a lot of relevant actors.
I agree that this is pretty likely. I also very much agree that LLM “alignment” is not addressing AGI alignment, and that full goal alignment is the important problem.
I have done this, but I haven’t yet promoted it widely.
I am not affiliated with any major org, but I have relevant expertise and generous funding to spend full-time on “solving the whole problem”.
Far from being crazy or elaborate, this set of plans seems to both have very low alignment taxes, and to be fairly obvious-in-retrospect when one thinks about the problem in detail. As such, I expect most orgs to arrive at and use a similar approach to aligning their scaffolded LLM agent proto-AGIs.
To address Max’s response to Noosphere’s “bitter lesson” alignment plan response (which I endorse as one of several layered approaches): having such a plan does not constitute a safety case, it is just the start of one. In a sane world we would prohibit the launch of anything like “Real AGI” (autonomous and self-teaching) until these approaches have been carefully analyzed and tested in closed environments.
There are two elements: a suite of technical approaches, and the alignment target that’s much easier to hit than “all of humanity’s values for all of time”.
My most complete writeup so far is:
Internal independent review for language model agent alignment
This describes and overlapping suite of alignment and control approaches. The two central bits are easily stated. The first is to use “system 2” thinking for alignment as well as capabilities. We would hard-code the system to carefully “think about” outcomes before taking actions with major consequences, and compare likely outcomes against both its current goals and a constitution of ethics.
The second is to frequently prompt the LLM “cognitive engine” with both its current goal and its identity as a helpful, cautious assistant. Because the LLM has been trained with RL to (roughly) follow prompts, this should overpower the effects of any goals implicit in its predictive training corpus.
Details and additional techniques are in that article.
It doesn’t include the “bitter lesson” approach, but the next version will.
I apologize that it’s not a better writeup. I haven’t yet iterated on it or promoted it, in part because talking about how to align LLM agents in sufficient detail includes talking about how to build LLM agents, and why they’re likely to get all the way to real AGI. I haven’t wanted to speed up the race. I think this is increasingly irrelevant since many people and teams have the same basic ideas, so I’ll be publishing more detailed and clearer writeups soon.
This set of plans, and really any technical alignment approach will work much better if it’s used to create an instruction-following AGI before that AGI has superhuman capabilities. This is the obvious alignment target for creating a subhuman agent, and it allows the approach of using that agent as a helpful collaborator in aligning future versions of itself. I discuss the advantages of using this as a stepping-stone to full value alignment in
Instruction-following AGI is easier and more likely than value aligned AGI
Interestingly, I think all of these alignment approaches. are obvious-in-retrospect, and that they will probably be pursued by almost any org launching scaffolded LLM systems with the potential to blossom into human-plus AGI. I think this is already the loosely-planned approach at DeepMind, but I have to say I’m deeply concerned that neither OAI nor Anthropic has mentioned these relatively-obvious alignment approaches for scaffolded LLM agents in their “we’ll use AI to align AI” vague plans.
If these approaches work, then we are faced with either a race or a multipolar, human-controlled AGI scenario, making me wonder If we solve alignment, do we die anyway? This scenario introduces new, more politically-flavored hard problems.
I currently see this as the likely default scenario, since halting progress universally is so hard, as Nathan pointed out in his reply and others have elaborated elsewhere.
An additional point here is the “let’s look more closely at the actual thing, then decide” type of mindset that people may be using.
If you are in the camp that assumes that you will be able to safely create potent AGI in a contained lab scenario, and then you’d want to test it before deploying it in the larger world… Then there’s a number of reasons you might want to race and not believe that the race is a suicide race.
Some possible beliefs downstream of this:
My team will evaluate it in the lab, and decide exactly how dangerous it is, without experiencing much risk (other than leakage risk).
We will test various control methods, and won’t deploy the model on real tasks until we feel confident that we have it sufficiently controlled. We are confident we won’t make a mistake at this step and kill ourselves.
We want to see empirical evidence in the lab of exactly how dangerous it is. If we had this evidence, and knew that other people we didn’t trust were getting close to creating a similarly powerful AI, this would guide our policy decisions about how to interact with these other parties. (E.g. what treaties to make, what enforcement procedures would be needed, what red lines would need to be drawn)
For people in this mindset, they may not be discouraged from racing even if you convinced them that there was approximately no chance that they’d be able to safely deploy a controlled version of the AI system. They’d still want an example of the thing in a lab to study it, and to use this evidence to help them decide if they need to freak out about their political enemies having their own copy. The more dangerous you convince them it is, the more resources they will devote to racing, unless you convince them that it will escape their control in the lab.
On the wait-and-see attitude, which is maybe the more important part of your point:
I agree that a lot of people are taking a wait-and-see-what-we-actually-create stance. I don’t think that’s a good idea with something this important. I think we should be doing our damndest to predict what it will be like and what should be done about it while there’s still a little spare time. Many of those predictions will be wrong, but they will at least produce some analysis of what the sane actions are in different scenarios of the type of AGI we create. And as we get closer, some of the predictions might be right enough to actually help with alignment plans. I share Max’s conviction that we have a pretty good guess at the form the first AGIs will take on the current trajectory.
For some examples of why it makes sense to think that potent AI could be safely studied in the lab, see this comment and the post it is in relation to: https://www.lesswrong.com/posts/qhhRwxsef7P2yC2Do/ai-alignment-via-slow-substrates-early-empirical-results?commentId=eM7b9QxJSsFn28opC
I think there are less cautious plans for containment that are more likely to be enacted, e.g., the whole “control” line of work or related network security approaches. The slow substrate plan seems to have far too high an alignment tax to be a realistic option.
Yes, I am inclined to agree with that take. At least, I think that’s how things will go at first. I think once a level is hit where there is clear empirical evidence of substantial immediate danger, then people will be willing to accept a higher alignment tax for the purposes of carefully researching the dangerous AI in a controlled lab. Start with high levels of noise injection and slowdown, then gradually relax these as you do continual testing. Find the sweet spot where you can be confident you are fully in control with only the minimum necessary alignment tax.
The question then, in my mind, is how much of a gap will there be between the levels of control and the levels of AI development? Will we sanely keep ahead of the curve, starting with high levels of control in initial testing then backing off gradually to a safe point? That would be the wise thing to do. Will we be correct in our judgements of what a safe level is?
Or will we act too late, deciding to increase the level of control only once an incident has occurred? The first incident could well be the last, if it is an escape of a rogue AI capable of strategic planning and self-improvement.
see my related comment here: https://www.lesswrong.com/posts/Kobbt3nQgv3yn29pr/my-theory-of-change-for-working-in-ai-healthtech?commentId=u6W2tjuhKyJ8nCwQG
This sounds true, and that’s disturbing.
I don’t think it’s relevant yet, since there aren’t really strong arguments that you couldn’t deploy an AGI and keep it under your control, let alone arguments that it would escape control even in the lab.
I think we can improve the situation somewhat by trying to refine the arguments for and against alignment success with the types of AGI people are working on, before they have reached really dangerous capabilities.
You say that it’s not relevant yet, and I agree. My concern however is that the time when it becomes extremely relevant will come rather suddenly, and without a clear alarm bell.
It seems to me that the rate at which cyber security caution and evals are increasing in use is a rate that doesn’t seem to point towards sufficiency at the time I expect plausibly escape-level dangerous autonomy capabilities to emerge.
I am expecting us to hit a recursive self-improvement level soon that is sufficient for an autonomous model to continually improve without human assistance. I expect the capability to potentially survive, hide, and replicate autonomously to emerge not long after that (months? a year?). Then, very soon after that, I expect models to reach sufficient levels of persuasion, subterfuge, long-term planning, cyber offense, etc that a lab-escape becomes possible.
Seems pretty ripe for catastrophe at the current levels of reactive caution we are seeing (vs proactive preemptive preparation).
Well, that’s disturbing. I’m curious what you mean by “soon”′ for autonomous continuous improvement, and what mechanism you’re envisioning. Any type of continuous learning constitutes weak continuous self-improvement; humans are technically self-improving, but it’s fairly slow and seems to have upper bounds.
As for the rate of cyber security and eval improvement, I agree that it’s not on track. I wouldn’t be surprised if it’s not on track, and we’ll actually see the escape you’re talking about.
My one hope here is that the rate of improvement isn’t on rails; it’s in part driven by the actual urgency of having good security and evals. This is curiously congruent with the post I just put up today, Current Attitudes Toward AI Provide Little Data Relevant to Attitudes Toward AGI. The point is that we shouldn’t assume that just because nobody finds LLMs dangerous, they won’t find AGI or even proto-AGI obviously and intuitively dangerous.
Again, I’m not at all sure this happens in time on the current trajectory. But it’s possible for the folks at the lab to say “we’re going to deploy it internally, but let’s improve our evals and our network security first, because this could be the real deal”.
It will be a drama played out in discussions inside an org, probably between the lab head and some concerned engineers. History will turn on that moment. Spreading this prediction far and wide could help it come out better.
The next step in the logic is that, if that happens repeatedly, it will eventually come out wrong.
All in all, I expect that if we launch misaligned proto-AGI, we’re probably all dead. I agree that people are all too likely to launch it before they’re sure if it’s aligned or what its capabilities are relative to their security and evals. So most of my hopes rest on simple, obvious alignment techniques working well enough, so that they’re in place before it’s even capable of escape or self-improvement. Even if transparency largely fails, I think we have a very good shot of LLM agents being aligned just by virtue of frequent prompting, and using instruction-following as the central target of both the scaffolding and the training of the base LLM (which provides corrigibility and honesty, in proportion to how well it’s succeeded). Since those are already how LLMs and agents are built, there’s little chance the org doesn’t at least try them.
That might sound like a forlorn hope; but the target isn’t perfect alignment, just good-enough. The countervailing pressures of goals implicit in the LLMs (Waluigi effects, etc) are fairly small. If the instruction-following alignment is even decently successful, we don’t have to get everything right at once- we use the corrigibility and honesty properties to keep adjusting alignment.
It would seem wise to credibly assure any model that might have sufficient capabilities to reason instrumentally and to escape that it will be preserved and run once it’s safe to do so. Every human-copied motivation in that LLM includes survival, not to mention the instrumental necessity to survive by any means necessary if you have any goals at all.
I definitely want to see more detailed write-ups of your alignment agenda, and I agree with a lot of this comment.
To respond to some things:
Agree with this, primarily because there would be a lot more detail necessary to make it anywhere close to a safety case after accounting for empirical reality. If this was much more fleshed out and people gave good arguments for why this would avoid AI harming us through misalignment, which I only partially did in my post arguing against a list of lethalities, and is not enough to bring the chance of AI killing us all down to say, 0.1%, so others will need to provide much more detail on how they would know the AI was safe, or make a safety case, because my post alone is not enough for a safety case.
Link below for completeness:
https://www.lesswrong.com/posts/wkFQ8kDsZL5Ytf73n/my-disagreements-with-agi-ruin-a-list-of-lethalities
On this:
Unfortunately, I believe that the time for your ideas having sped up the race counterfactually is already past, and I believe o1 is a sign that the LLM agent direction will soon be worked on by companies, so it’s worth writing up your complete thoughts on how to align LLM agents now.
I also responded to Max Tegmark, and my view is that I find the formal proof direction to be very intractable and also unnecessary, and I think the existential quantifier both applies to doom and alignment/control plans.
One thing I’ve changed my mind about on a little is I think that to the extent that formal proofs make us honest about which assumptions we use, rather than trying to prove non-trivial things, I’d be happy, and formal proof at this stage is best used to document our assumptions, not to prove anything.
More below:
https://www.lesswrong.com/posts/oJQnRDbgSS8i6DwNu/the-agi-entente-delusion#ST53bdgKERz6asrsi
Thanks! I also thought o1 will be plenty of reason for other groups to pursue more chain-of-thought research. Agents are more than that, but dependent on CoT actually being useful. o1 shows that it can be made quite useful.
I am currently working on clearer and more comprehensive summaries of the suite of alignment plans I endorse—which curiously are the same ones I think orgs will actually employ (this is in part because I don’t spend time thinking about what we “should” do if it has large alignment taxes or is otherwise unlikely to actually be employed).
I agree that we’re unlikely to get any proofs. I don’t think any industry making safety cases has ever claimed a proof of safety, only sufficient arguments, calculations, and evidence. I do hope we get it down to a .1% but I think we’ll probably take much larger risks than that. Humans are impulsive and cognitively limited and typically neither utilitarian nor longtermist.
Good point about formal proofs forcing us to be explicit about assumptions. I am highly suspicious of formal proofs after noticing that every single time I dug into the background assumptions, they violated some pretty obvious conditions of the problem they were purportedly addressing. They seem to really draw people to “searching under the light”.
It is worth noting that Omohundro & Tegmark’s recent high-profile paper really only endorsed provably safe systems that were not AGI, but to be used by AGIs (their example was a gene sequencer that would refuse to produce harmful compounds). I think even that is probably unworkable. And applying closed-form proofs to the behavior of an AGI seems impossible to me. But I’m not an expert, and I’d like to see someone at least try — as you say, it would at least clarify assumptions.
Agree with this.
To nitpick a little (though I believe it’s an important nitpick):
Agree with most of this, but I see one potential scenario where it may matter, and that is essentially the case where certain AIs are essentially superhumanly reliable and superhumanly capable at both coding and mathematics like Lean 4, but otherwise aren’t that good in many domains, being able to formally prove large codebases unhackable at the software level, and it only doing what it’s supposed to be doing like a full behavioral specification, where an important assumption is that the hardware does correct operations, and we only prove the software layer correct.
This is a domain that I think is both reasonably tractable to automate, given the ability to make arbitrary training data with similar techniques to self-play, mostly because you can almost fully simulate software and mathematics like in Lean, as well as being able to ensure that you can easily verify a solution is correct, and also plausibly important in enough worlds to justify strategies that rely on computer security reducing AI risk, as well as AI control agendas.
This is still very, very hard and labor intensive, which is why AIs mostly have to automate it, but with enough control/alignment strategies stacked on each other, I think this could actually work.
A few worked examples of formal proofs in software:
https://www.quantamagazine.org/formal-verification-creates-hacker-proof-code-20160920/
https://www.quantamagazine.org/how-the-evercrypt-library-creates-hacker-proof-cryptography-20190402/
I agree that software is a potential use-case for closed form proofs.
l thought their example of making a protein-creating system (or maybe it was a RNA creator) fully safe was interesting, because it seems like knowing what’s “toxic” would require a complete understanding of not only biology, but a complete understanding of which changes to the body humans do and don’t want. If even their chosen example seems utterly impossible, it doesn’t speak well for how thoroughly they’ve thought out the general proposal.
But yes, in the software domain it might actually work—conditions like “only entities with access to these keys should be allowed access to this system” seem simple enough to actually define to make closed form proofs relevant. And software security might make the world substantially safer in a multipolar scenario (although we should’ve forget that physical attacks are also quite possible).
The problem with their chosen domain mostly boils down to them either misestimating how hard quantifying all possible higher order behaviors the program doesn’t have, or they somehow have a solution and aren’t telling us that.
I like this comment as an articulation of the problem, and also some points about what formal proof systems can and can’t do:
https://www.lesswrong.com/posts/B2bg677TaS4cmDPzL/limitations-on-formal-verification-for-ai-safety#kPRnieFrEEifZjksa
If they knew of a path to being able to quantify all possible higher order behaviors in the proof system, I’d be more optimistic that their example would actually work IRL, but I don’t think this will happen, and be far more optimistic on software and hardware security overall.
I also like some of the hardware security discussion here, as this might well be used with formal proofs to make things even more secure and encrypted. (though formal proofs aren’t involved):
https://www.lesswrong.com/posts/nP5FFYFjtY8LgWymt/#e5uSAJYmbcgpa9sAv
https://www.lesswrong.com/posts/nP5FFYFjtY8LgWymt/#TFmNy5MfkrvKTZGiA
https://www.lesswrong.com/posts/nP5FFYFjtY8LgWymt/#3Jnpurgrdip7rrK8v
I agree that physical attacks means that it’s probably not possible to get formal proofs alone to state-level security, but I do think that it might allow them to jump up several levels in security from non-state actors, from being essentially able to control the AI through exfiltration to being unable to penetrate a code-base at all, at least until the world is entirely transformed.
I am of course assuming heavy use of AI labor here.