First problem, A lot of future gains may come from RL style self play (IE:let the AI play around solving open ended problems) That’s not safe in the way you outline above.
The other standard objection is that even if the initial AGI is safe people will do their best to jailbreak the hell out of that safety and they will succeed.
That’s a problem when put together with selection pressure for bad agentic AGIs (since they can use sociopathic strategies good AGIs will not use like scamming, hacking, violence etc.). (IE:natural selection goes to work and the results blow up in our face)
Short of imposing very stringent unnatural selection on the initial AGIs to come, the default outcome is something nasty emerging. Do you trust the AGI to stay aligned when faced with all the bad actors out there?
Note:my P(doom)=30% (P(~doom) depends on either a good AGI executing one of the immoral strategies to pre-empt a bad AGI (50%) or maybe somehow scaling just fixes alignment(20%))
>First problem, A lot of future gains may come from RL style self play (IE:let the AI play around solving open ended problems)
How do people see this working? I understand the value of pointing to AI dominance in Chess/Go as illustrating how we should expect AI to recursively exceed humans at tasks, but I can’t see how RL would be similarly applied to “open-ended problems” to promote similar explosive learning. What kind of open problems with a clear and instantly-discernable reward function would promote AGI growth, rather than a more-narrow type of growth geared towards solving the particular problem well?
Note: This is an example of how to do the bad thing (extensive RL fine tuning/training). If you do it the result may be misalignment, killing you/everyone.
To name one good example that is very relevant, programming, specifically having the AI complete easy to verify small tasks.
The general pattern is to take existing horribly bloated software/data and extract useful subproblems from it. (EG:find the parts of this code that are taking the most time) and then turn those into problems for the AI to solve(eg: here is a function + examples of it being called, make it faster). Ground truth metrics would be simple things that are easy to measure (EG:execution time, code quality/smallness, code coverage, is the output the same?) and then credit assignment for sub-task usefulness can be handled by an expected value estimator trained on that ground truth as is done in traditional game playing RL. Possibly it’s just one AI with different prompts.
Basically Microsoft takes all the repositories on GitHub that build sucessfully and have some unit tests, and builds an AI augmented pipeline to extract problems from that software. Alternatively, a large company that runs lots of code takes snapshots + IO traces of production machines, and derives examples from that. You need code in the wild doing it’s thing.
Some example sub-tasks in the domain of software engineering:
make a piece of code faster
make this pile of code smaller
is f(x)==g(x)? If not find a counterexample (useful for grading the above)
find a vulnerability and write an exploit.
fix the bug while preserving functionality
identify invariants/data structures/patterns in memory (EG:linked lists, reference counts)
useful as a building block for further tasks (EG:finding use after free bugs)
Larger problems could be approached by identifying useful instrumental subgoals once the model can actually perform them reliably.
The finished system should be able to extend shoggoth tentacles into a given computer, identify what that computer is doing and make it do it better or differently.
The finished system might be able to extend shoggoth tentacles into other things too! (EG:embedded systems, FPGAs) Capability limitations would stem from the need for fast feedback so software, electronics and programmable hardware should be solvable. For other domains, simulation can help(limited by simulation fidelity and goodharting). The eventual result is a general purpose engineering AI.
Tasks heavily dependent on human judgement (EG:is this a good book? Is this action immoral) have obviously terrible feedback cost/latency and so scale poorly. This is a problem if we want the AI to not do things a human would disapprove of.
RL training could lead to a less grotesque solution. IE:just read the password from memory using the debugger rather than writing a program to repeatedly run the executable and brute force the password.
>The finished system should be able to extend shoggoth tentacles into a given computer, identify what that computer is doing and make it do it better or differently.
Sure. GPT-X will probably help optimize a lot of software. But I don’t think having more resource efficiency should be assumed to lead to recursive self-improvement beyond where we’d be at given a “perfect” use of current software tools. Will GPT-X be able to break out of those current set of tools, only having been trained to complete text and not to actually optimize systems? I don’t take this for granted, and my view is that LLMs are unlikely to devise radically new software architectures on their own.
<rant>It really pisses me off that the dominant “AI takes over the world” story is more or less “AI does technological magic”. Nanotech assemblers, superpersuasion, basilisk hacks and more. Skeptics who doubt this are met with “well if it can’t it just improves itself until it can”. The skeptics obvious rebuttal that RSI seems like magic too is not usually addressed.</rant>
Note:RSI is in my opinion an unpredictable black swan. My belief is RSI will yield somewhere between 1.5-5x speed improvement to a nascent AGI from improvements in GPU utilisation and sparsity/quantisation, requiring significant cognition spent to achieve speedups. AI is still dangerous in worlds where RSI does not occur.
Self play generally gives superhuman performance(GO,chess, etc.) even in more complicated imperfect information games (DOTA, Starcraft). Turning a field of engineering into a self-playable game likely leads to (superhuman(80%),Top-human equiv(18%),no change(2%)) capabilities in that field. Superhuman or top-human software engineering (vulnerability discovery and programming) is one relatively plausible path to AI takeover.
find vulnerabilities about as well as the researchers at project zero
generate reasonable plans on par with a +1sd int human (IE:not hollywood style movie plots like GPT-4 seems fond of)
AI does not need to be even superhuman to be an existential threat. Hack >95% of devices, extend shoggoth tentacles, hold all the data/tech hostage, present as not skynet so humans grudgingly cooperate, build robots to run economy(some humans will even approve of this), kill all humans, done.
RL isn’t magic though. It works in the Go case because we can simulate Go games quickly and easily score the results and then pit adversarial AIs against eachother in order to iteratively learn.
I don’t think this sort of process lends itself to the sort of tasks that we can only see an AGI accomplishing. You can’t train it to say write a better version of Winds of Winter than GRRM could because you don’t have a good algorithm to score each iteration.
So what I’m really trying to ask is what specific sort of open ended problems do we see being particularly conducive to fostering AGI, as opposed to a local maximizer that’s highly specialized towards the particular problem?
A generality maximizer, where the machine has a large set of “skills” it has learned on many different tasks, can allow it to perform well on zero shot untrained tasks. This was seen in Palm-E and GPT-4.
A machine that can do a very large number of tasks that are evaluatable, and at least do ok by mimicking the average human or by weighting the text it learned from by scoring estimates is still an AGI.
I think you moved the goalposts from “machine as capable as an average human” or even. “capable as a top 1 percent human and superintelligent in any task with a narrow metric” to “beats humans at EVERYTHING”. That is an unreasonable goal and high performing ASIs may not be able to write better that grrm either.
I’m asking specifically about the assertion that “RL style self play” could be used to iterate to AGI. I don’t see what sort of game could lead to this outcome. You can’t have this sort of self-play with “solve this math problem” as far as I can tell, and even if you could I don’t see why it would promote AGI as opposed to something that can solve a narrow class of math problems.
Obviously LLMs have amazing generalist capabilities. But as far as I can tell you can’t iterate on the next version of these models by hooking them up to some sort of API that provides useful, immediate feedback… we’re not at the cusp of removing the HF part of the RLHF loop. I think understanding this is key to whether we should expect slow takeoff vs. fast takeoff likelihood.
This will work, the only reason it won’t get used is it is possibly not the computationally cheapest option. (this proposal is incredibly expensive for compute unless we do a lot of reuse of components between iterations).
Whether you consider a machine that has a score heuristic that forces generality by negatively weighting complex specialized architectures and heavily waiting zero shot multimodal/multi-skill tasks, and is able to do hundreds of thousands of tasks an “AGI” is up to your definition.
Since the machine would be self replicating and capable of all industrial, construction, driving, logistics, software writing tasks - all things that conveniently fall into the scope of ‘can be objectively evaluated’ I say it’s an AGI. It’s capable of everything needed to copy itself forever and to self improve, it’s functionally a sentient new civilization. The things you mentioned—like beating GRRM at writing a good story—do not matter.
Sure, this is useful. To your other posts, I don’t think we’re really disagreeing about what AGI is—I think we’d agree that if you took a model with GPT4-like capabilities and hooked it up to a chess API to reinforce it you would end up with a GPT4 model that’s very good at playing chess, not something that has strongly-improved its general underlying world model and thus would also be able to say improve its LSAT score. And this is what I’m imaging most self-play training would accomplish… but I’m open to being wrong. To your point about having a “benchmark of many tasks”, I guess maybe I could imagine hooking it up to like 100 different self-playing games which are individually easy to run but require vastly different skills to master, but I could also see this just… not working as well. Teams have been trying this for a decade or so already, right? A breakthrough is possible though for sure.
I’m just trying to underscore that there are lots of tasks which we hope that AGIs would be able to accomplish (eg. solving open math problems) but we probably cannot use RL to directly iterate a model to accomplish this task because we can’t define a gradient of reward that would help define the AGI.
To your point about having a “benchmark of many tasks”, I guess maybe I could imagine hooking it up to like 100 different self-playing games which are individually easy to run but require vastly different skills to master, but I could also see this just… not working as well. Teams have been trying this for a decade or so already, right? A breakthrough is possible though for sure.
No, nobody has been trying anything for decades that matters. As it turns out, the only thing that matters was scale. So there are 3 companies that had enough money for scale, and they are the only efforts that count, and all combined have done a small enough number of full scale experiments you can count them up with 2 hands. @gwern has expressed the opinion that we probably didn’t even need the transformer, other neural networks likely would have worked at these scales.
As for the rest of it, no, we’re saying at massive scales, we abdicate trying to understand AGI architectures—since they are enormously complex and coupled machined—and just iteratively find some that work by trial and error.
“work” includes generality. The architecture that can play 100 games and does extremely well at game 101 the first try gets way more points than one that doesn’t. The one that has never read a book on the topic of the LSAT but still does well on the exam is exactly what we are looking for. (though this can be tough to filter since obviously it’s simply easier to train on all text in existence).
One that has controlled a robot to manipulate fine wire and many object manip tasks, and one that has passed the exams for a course on electronics, and then first try builds a working circuit in a simulated world is what we’re looking for. So more points on that.
That’s the idea. Define what we want the machine to do and what we mean by “generality”, iterate over the search space a very large number of times. In an unbiased way, pick the most distinct n winners and have those winners propose the next round of AGI designs and so on.
And most of the points for the winners are explicitly for the generality behavior we are seeking.
>As it turns out, the only thing that matters was scale.
I mean, in some sense yes. But AlphaGo wasn’t trained by finding a transcript of every Go game that had ever been played, but instead was trained via self-play RL. But attempts to create general game-playing agents via similar methods haven’t worked out very well, in my understanding. I don’t assume that if we just threw 10x or 100x data at them that this would change...
>The architecture that can play 100 games and does extremely well at game 101 the first try gets way more points than one that doesn’t. The one that has never read a book on the topic of the LSAT but still does well on the exam is exactly what we are looking for.
Yes, but the latter exists and is trained via human reinforcement learning that can’t be translated to self-play. The former doesn’t exist as far as I can tell. I don’t see anyone proposing to improve GPT-4 by turning from HFRL to self-play RL.
Ultimately I think there’s a possibility that the improvements to LLMs from further scaling may not be very large, and instead we’ll need to find some sort of new architecture to create dangerous AGIs.
Gpt-4 did RL feedback that was self evaluation across all the inputs users fed by chatGPT.
Self play would be having it practice leetcode problems with the RL feedback the score.
The software support is there and the RL feedback worked, why do you think it is even evidence to say “obvious thing that works well hasn’t been done yet or maybe it has, openAI won’t say”
There is also a tremendous amount of self play possible now with the new plugin interface.
You can connect them to such an API and it’s not hard and we already have the things to make the API and you can start with llms. It’s a fairly simple recursive bench and obvious.
I think you need to define what you think AGI is first.
I think with a reasonable, grounded, and measurable version of AGI it is trivial to do with self play. Please tell me what you think AGI means. I don’t think it matters if there are subjective things the AGI can’t do well.
First problem, A lot of future gains may come from RL style self play (IE:let the AI play around solving open ended problems) That’s not safe in the way you outline above.
Still, offline learning is very useful, and so long as you do enough offline learning, then you don’t have problems in the online learning phase.
Next, jailbreaking. I’ll admit, this isn’t something I initially covered, though if we admit that alignment is achievable, and we only have the question over whether alignment is stable, then in my model we’ve won almost all the value, as my threat model is closer to “We want good, capable AGI, but we can’t get it because aligning it is very difficult.”
So I think alignment was the load-bearing part of my model, and thus we have much lower p(Doom), more like 0.1-10% probability.
First problem, A lot of future gains may come from RL style self play (IE:let the AI play around solving open ended problems) That’s not safe in the way you outline above.
The other standard objection is that even if the initial AGI is safe people will do their best to jailbreak the hell out of that safety and they will succeed.
That’s a problem when put together with selection pressure for bad agentic AGIs (since they can use sociopathic strategies good AGIs will not use like scamming, hacking, violence etc.). (IE:natural selection goes to work and the results blow up in our face)
Short of imposing very stringent unnatural selection on the initial AGIs to come, the default outcome is something nasty emerging. Do you trust the AGI to stay aligned when faced with all the bad actors out there?
Note:my P(doom)=30% (P(~doom) depends on either a good AGI executing one of the immoral strategies to pre-empt a bad AGI (50%) or maybe somehow scaling just fixes alignment(20%))
>First problem, A lot of future gains may come from RL style self play (IE:let the AI play around solving open ended problems)
How do people see this working? I understand the value of pointing to AI dominance in Chess/Go as illustrating how we should expect AI to recursively exceed humans at tasks, but I can’t see how RL would be similarly applied to “open-ended problems” to promote similar explosive learning. What kind of open problems with a clear and instantly-discernable reward function would promote AGI growth, rather than a more-narrow type of growth geared towards solving the particular problem well?
Note: This is an example of how to do the bad thing (extensive RL fine tuning/training). If you do it the result may be misalignment, killing you/everyone.
To name one good example that is very relevant, programming, specifically having the AI complete easy to verify small tasks.
The general pattern is to take existing horribly bloated software/data and extract useful subproblems from it. (EG:find the parts of this code that are taking the most time) and then turn those into problems for the AI to solve(eg: here is a function + examples of it being called, make it faster). Ground truth metrics would be simple things that are easy to measure (EG:execution time, code quality/smallness, code coverage, is the output the same?) and then credit assignment for sub-task usefulness can be handled by an expected value estimator trained on that ground truth as is done in traditional game playing RL. Possibly it’s just one AI with different prompts.
Basically Microsoft takes all the repositories on GitHub that build sucessfully and have some unit tests, and builds an AI augmented pipeline to extract problems from that software. Alternatively, a large company that runs lots of code takes snapshots + IO traces of production machines, and derives examples from that. You need code in the wild doing it’s thing.
Some example sub-tasks in the domain of software engineering:
make a piece of code faster
make this pile of code smaller
is f(x)==g(x)? If not find a counterexample (useful for grading the above)
find a vulnerability and write an exploit.
fix the bug while preserving functionality
identify invariants/data structures/patterns in memory (EG:linked lists, reference counts)
useful as a building block for further tasks (EG:finding use after free bugs)
GPT-4 can already use a debugger to solve a dead simple reverse engineering problem albeit stupidly[1] https://arxiv.org/pdf/2303.12712.pdf#page=119
Larger problems could be approached by identifying useful instrumental subgoals once the model can actually perform them reliably.
The finished system should be able to extend shoggoth tentacles into a given computer, identify what that computer is doing and make it do it better or differently.
The finished system might be able to extend shoggoth tentacles into other things too! (EG:embedded systems, FPGAs) Capability limitations would stem from the need for fast feedback so software, electronics and programmable hardware should be solvable. For other domains, simulation can help(limited by simulation fidelity and goodharting). The eventual result is a general purpose engineering AI.
Tasks heavily dependent on human judgement (EG:is this a good book? Is this action immoral) have obviously terrible feedback cost/latency and so scale poorly. This is a problem if we want the AI to not do things a human would disapprove of.
RL training could lead to a less grotesque solution. IE:just read the password from memory using the debugger rather than writing a program to repeatedly run the executable and brute force the password.
>The finished system should be able to extend shoggoth tentacles into a given computer, identify what that computer is doing and make it do it better or differently.
Sure. GPT-X will probably help optimize a lot of software. But I don’t think having more resource efficiency should be assumed to lead to recursive self-improvement beyond where we’d be at given a “perfect” use of current software tools. Will GPT-X be able to break out of those current set of tools, only having been trained to complete text and not to actually optimize systems? I don’t take this for granted, and my view is that LLMs are unlikely to devise radically new software architectures on their own.
<rant>It really pisses me off that the dominant “AI takes over the world” story is more or less “AI does technological magic”. Nanotech assemblers, superpersuasion, basilisk hacks and more. Skeptics who doubt this are met with “well if it can’t it just improves itself until it can”. The skeptics obvious rebuttal that RSI seems like magic too is not usually addressed.</rant>
Note:RSI is in my opinion an unpredictable black swan. My belief is RSI will yield somewhere between 1.5-5x speed improvement to a nascent AGI from improvements in GPU utilisation and sparsity/quantisation, requiring significant cognition spent to achieve speedups. AI is still dangerous in worlds where RSI does not occur.
Self play generally gives superhuman performance(GO,chess, etc.) even in more complicated imperfect information games (DOTA, Starcraft). Turning a field of engineering into a self-playable game likely leads to (superhuman(80%),Top-human equiv(18%),no change(2%)) capabilities in that field. Superhuman or top-human software engineering (vulnerability discovery and programming) is one relatively plausible path to AI takeover.
https://googleprojectzero.blogspot.com/2023/03/multiple-internet-to-baseband-remote-rce.html
Can an AI take over the world if it can?:
do end to end software engineering
find vulnerabilities about as well as the researchers at project zero
generate reasonable plans on par with a +1sd int human (IE:not hollywood style movie plots like GPT-4 seems fond of)
AI does not need to be even superhuman to be an existential threat. Hack >95% of devices, extend shoggoth tentacles, hold all the data/tech hostage, present as not skynet so humans grudgingly cooperate, build robots to run economy(some humans will even approve of this), kill all humans, done.
That’s one of the easier routes assuming the AI can scale vulnerability discovery. With just software engineering and a bit of real world engineering(potentially outsourceable) other violent/coercive options could work albeit with more failure risk.
Math problems, physical problems, doing stuff in simulations, playing games.
RL isn’t magic though. It works in the Go case because we can simulate Go games quickly and easily score the results and then pit adversarial AIs against eachother in order to iteratively learn.
I don’t think this sort of process lends itself to the sort of tasks that we can only see an AGI accomplishing. You can’t train it to say write a better version of Winds of Winter than GRRM could because you don’t have a good algorithm to score each iteration.
So what I’m really trying to ask is what specific sort of open ended problems do we see being particularly conducive to fostering AGI, as opposed to a local maximizer that’s highly specialized towards the particular problem?
A generality maximizer, where the machine has a large set of “skills” it has learned on many different tasks, can allow it to perform well on zero shot untrained tasks. This was seen in Palm-E and GPT-4.
A machine that can do a very large number of tasks that are evaluatable, and at least do ok by mimicking the average human or by weighting the text it learned from by scoring estimates is still an AGI.
I think you moved the goalposts from “machine as capable as an average human” or even. “capable as a top 1 percent human and superintelligent in any task with a narrow metric” to “beats humans at EVERYTHING”. That is an unreasonable goal and high performing ASIs may not be able to write better that grrm either.
I’m asking specifically about the assertion that “RL style self play” could be used to iterate to AGI. I don’t see what sort of game could lead to this outcome. You can’t have this sort of self-play with “solve this math problem” as far as I can tell, and even if you could I don’t see why it would promote AGI as opposed to something that can solve a narrow class of math problems.
Obviously LLMs have amazing generalist capabilities. But as far as I can tell you can’t iterate on the next version of these models by hooking them up to some sort of API that provides useful, immediate feedback… we’re not at the cusp of removing the HF part of the RLHF loop. I think understanding this is key to whether we should expect slow takeoff vs. fast takeoff likelihood.
Anyways here’s how to get an AGI this way : https://www.lesswrong.com/posts/Aq82XqYhgqdPdPrBA/full-transcript-eliezer-yudkowsky-on-the-bankless-podcast?commentId=Mvyq996KxiE4LR6ii
This will work, the only reason it won’t get used is it is possibly not the computationally cheapest option. (this proposal is incredibly expensive for compute unless we do a lot of reuse of components between iterations).
Whether you consider a machine that has a score heuristic that forces generality by negatively weighting complex specialized architectures and heavily waiting zero shot multimodal/multi-skill tasks, and is able to do hundreds of thousands of tasks an “AGI” is up to your definition.
Since the machine would be self replicating and capable of all industrial, construction, driving, logistics, software writing tasks - all things that conveniently fall into the scope of ‘can be objectively evaluated’ I say it’s an AGI. It’s capable of everything needed to copy itself forever and to self improve, it’s functionally a sentient new civilization. The things you mentioned—like beating GRRM at writing a good story—do not matter.
Sure, this is useful. To your other posts, I don’t think we’re really disagreeing about what AGI is—I think we’d agree that if you took a model with GPT4-like capabilities and hooked it up to a chess API to reinforce it you would end up with a GPT4 model that’s very good at playing chess, not something that has strongly-improved its general underlying world model and thus would also be able to say improve its LSAT score. And this is what I’m imaging most self-play training would accomplish… but I’m open to being wrong. To your point about having a “benchmark of many tasks”, I guess maybe I could imagine hooking it up to like 100 different self-playing games which are individually easy to run but require vastly different skills to master, but I could also see this just… not working as well. Teams have been trying this for a decade or so already, right? A breakthrough is possible though for sure.
I’m just trying to underscore that there are lots of tasks which we hope that AGIs would be able to accomplish (eg. solving open math problems) but we probably cannot use RL to directly iterate a model to accomplish this task because we can’t define a gradient of reward that would help define the AGI.
To your point about having a “benchmark of many tasks”, I guess maybe I could imagine hooking it up to like 100 different self-playing games which are individually easy to run but require vastly different skills to master, but I could also see this just… not working as well. Teams have been trying this for a decade or so already, right? A breakthrough is possible though for sure.
No, nobody has been trying anything for decades that matters. As it turns out, the only thing that matters was scale. So there are 3 companies that had enough money for scale, and they are the only efforts that count, and all combined have done a small enough number of full scale experiments you can count them up with 2 hands. @gwern has expressed the opinion that we probably didn’t even need the transformer, other neural networks likely would have worked at these scales.
As for the rest of it, no, we’re saying at massive scales, we abdicate trying to understand AGI architectures—since they are enormously complex and coupled machined—and just iteratively find some that work by trial and error.
“work” includes generality. The architecture that can play 100 games and does extremely well at game 101 the first try gets way more points than one that doesn’t. The one that has never read a book on the topic of the LSAT but still does well on the exam is exactly what we are looking for. (though this can be tough to filter since obviously it’s simply easier to train on all text in existence).
One that has controlled a robot to manipulate fine wire and many object manip tasks, and one that has passed the exams for a course on electronics, and then first try builds a working circuit in a simulated world is what we’re looking for. So more points on that.
That’s the idea. Define what we want the machine to do and what we mean by “generality”, iterate over the search space a very large number of times. In an unbiased way, pick the most distinct n winners and have those winners propose the next round of AGI designs and so on.
And most of the points for the winners are explicitly for the generality behavior we are seeking.
>As it turns out, the only thing that matters was scale.
I mean, in some sense yes. But AlphaGo wasn’t trained by finding a transcript of every Go game that had ever been played, but instead was trained via self-play RL. But attempts to create general game-playing agents via similar methods haven’t worked out very well, in my understanding. I don’t assume that if we just threw 10x or 100x data at them that this would change...
>The architecture that can play 100 games and does extremely well at game 101 the first try gets way more points than one that doesn’t. The one that has never read a book on the topic of the LSAT but still does well on the exam is exactly what we are looking for.
Yes, but the latter exists and is trained via human reinforcement learning that can’t be translated to self-play. The former doesn’t exist as far as I can tell. I don’t see anyone proposing to improve GPT-4 by turning from HFRL to self-play RL.
Ultimately I think there’s a possibility that the improvements to LLMs from further scaling may not be very large, and instead we’ll need to find some sort of new architecture to create dangerous AGIs.
Gpt-4 did RL feedback that was self evaluation across all the inputs users fed by chatGPT.
Self play would be having it practice leetcode problems with the RL feedback the score.
The software support is there and the RL feedback worked, why do you think it is even evidence to say “obvious thing that works well hasn’t been done yet or maybe it has, openAI won’t say”
There is also a tremendous amount of self play possible now with the new plugin interface.
You can connect them to such an API and it’s not hard and we already have the things to make the API and you can start with llms. It’s a fairly simple recursive bench and obvious.
Main limit is just money.
I think you need to define what you think AGI is first.
I think with a reasonable, grounded, and measurable version of AGI it is trivial to do with self play. Please tell me what you think AGI means. I don’t think it matters if there are subjective things the AGI can’t do well.
Still, offline learning is very useful, and so long as you do enough offline learning, then you don’t have problems in the online learning phase.
Next, jailbreaking. I’ll admit, this isn’t something I initially covered, though if we admit that alignment is achievable, and we only have the question over whether alignment is stable, then in my model we’ve won almost all the value, as my threat model is closer to “We want good, capable AGI, but we can’t get it because aligning it is very difficult.”
So I think alignment was the load-bearing part of my model, and thus we have much lower p(Doom), more like 0.1-10% probability.