AI strategy consideration. We won’t know which AI run will be The One. Therefore, the amount of care taken on the training run which produces the first AGI, will—on average—be less careful than intended.
It’s possible for a team to be totally blindsided. Maybe they thought they would just take a really big multimodal init, finetune it with some RLHF on quality of its physics reasoning, have it play some video games with realistic physics, and then try to get it to do new physics research. And it takes off. Oops!
It’s possible the team suspected, but had a limited budget. Maybe you can’t pull out all the stops for every run, you can’t be as careful with labeling, with checkpointing and interpretability and boxing.
No team is going to run a training run with more care than they would have used for the AGI Run, especially if they don’t even think that the current run will produce AGI. So the average care taken on the real AGI Run will be strictly less than intended.
Teams which try to be more careful on each run will take longer to iterate on AI designs, thereby lowering the probability that they (the relatively careful team) will be the first to do an AGI Run.
Upshots:
The alignment community should strive for anytime performance on their alignment recommendations, such that we make a difference even if AGI comes by “surprise.” We will not necessarily observe a bunch of externally visible fanfare and anticipation before AGI comes. We should not count on “and then the bigshot researchers have time to hash out final disagreements and then hand over a dossier of their alignment recommendations.”
We should, at each year, have a set of practical recommendations for any lab which thinks they might build an AGI soon, and are going to go ahead with it anyways (even though that’s extremely unwise).
These recommendations should not be too onerous. Instead, they should be stratified, comprising of multiple levels of “alignment tax” which an AGI team can levy according to their suspicion that this run will be It. For example:
Low/no tax tips:
If you’re going to include non-IID finetuning, that may lead to agentic cognition. In every run where you do this, finetune also on a few human-approval related tasks, such as [A, B, C; I haven’t actually worked out my best guesses here].
Otherwise, if the run surprisingly hits AGI, you may not have included any human-relevant value formation data, and there was no chance for the AGI to be aligned even under relatively optimistic worldviews.
Scrub training corpus of mentions of Roko’s basilisk-type entitites. [This one might cost weirdness points, depends on lab] Including such entities might enable relatively dumb agents to model infohazardous entities which blackmail them while the agent is too dumb to realize they shouldn’t think about the entities at all. Otherwise these entities are probably not a big deal, as long as the AI doesn’t abstractly realize their existence until the AI is relatively smart.
More taxing tips:
Run interpretability tools A, B, C and look out for concept and capability formation D, E, F.
Use boxing precautions G and H.
High-tax runs:
Use labeling techniques as follows… Be careful with X and Y forms of data augmentation.
Keep reward sources (like buttons) out of sight of the agent and don’t mention how the agent is being rewarded, so as to decrease P(agent reinforced for getting reward in and of itself). In interactions, emphasize that the agent is reinforced for doing what we want.
(Fancier alignment techniques, if we deem those wise)
I think this framing is accurate and important. Implications are of course “undignified” to put it lightly...
Broadly agree on upshot (1), though of course I hope we can do even better. (2) is also important though IMO way too weak. (Rule zero: ensure that it’s never your lab that ends the world)
AI strategy consideration. We won’t know which AI run will be The One. Therefore, the amount of care taken on the training run which produces the first AGI, will—on average—be less careful than intended.
It’s possible for a team to be totally blindsided. Maybe they thought they would just take a really big multimodal init, finetune it with some RLHF on quality of its physics reasoning, have it play some video games with realistic physics, and then try to get it to do new physics research. And it takes off. Oops!
It’s possible the team suspected, but had a limited budget. Maybe you can’t pull out all the stops for every run, you can’t be as careful with labeling, with checkpointing and interpretability and boxing.
No team is going to run a training run with more care than they would have used for the AGI Run, especially if they don’t even think that the current run will produce AGI. So the average care taken on the real AGI Run will be strictly less than intended.
Teams which try to be more careful on each run will take longer to iterate on AI designs, thereby lowering the probability that they (the relatively careful team) will be the first to do an AGI Run.
Upshots:
The alignment community should strive for anytime performance on their alignment recommendations, such that we make a difference even if AGI comes by “surprise.” We will not necessarily observe a bunch of externally visible fanfare and anticipation before AGI comes. We should not count on “and then the bigshot researchers have time to hash out final disagreements and then hand over a dossier of their alignment recommendations.”
We should, at each year, have a set of practical recommendations for any lab which thinks they might build an AGI soon, and are going to go ahead with it anyways (even though that’s extremely unwise).
These recommendations should not be too onerous. Instead, they should be stratified, comprising of multiple levels of “alignment tax” which an AGI team can levy according to their suspicion that this run will be It. For example:
Low/no tax tips:
If you’re going to include non-IID finetuning, that may lead to agentic cognition. In every run where you do this, finetune also on a few human-approval related tasks, such as [A, B, C; I haven’t actually worked out my best guesses here].
Otherwise, if the run surprisingly hits AGI, you may not have included any human-relevant value formation data, and there was no chance for the AGI to be aligned even under relatively optimistic worldviews.
Scrub training corpus of mentions of Roko’s basilisk-type entitites. [This one might cost weirdness points, depends on lab] Including such entities might enable relatively dumb agents to model infohazardous entities which blackmail them while the agent is too dumb to realize they shouldn’t think about the entities at all. Otherwise these entities are probably not a big deal, as long as the AI doesn’t abstractly realize their existence until the AI is relatively smart.
More taxing tips:
Run interpretability tools A, B, C and look out for concept and capability formation D, E, F.
Use boxing precautions G and H.
High-tax runs:
Use labeling techniques as follows… Be careful with X and Y forms of data augmentation.
Keep reward sources (like buttons) out of sight of the agent and don’t mention how the agent is being rewarded, so as to decrease P(agent reinforced for getting reward in and of itself). In interactions, emphasize that the agent is reinforced for doing what we want.
(Fancier alignment techniques, if we deem those wise)
I think this framing is accurate and important. Implications are of course “undignified” to put it lightly...
Broadly agree on upshot (1), though of course I hope we can do even better. (2) is also important though IMO way too weak. (Rule zero: ensure that it’s never your lab that ends the world)
As usual, opinions my own.