The second is reliability. Yes, there is the issue that often consumer expectations are for 100% reliability, and sometimes you actually do need 100% reliability (or at least 99.99% or what not). I see no reason you can’t, for well-defined tasks like computers traditionally do, get as many 9s as you want to pay for, as long as you are willing to accept a cost multiplier.
The problem is that AI is intelligence, not a deterministic program, yet we are holding it to deterministic standards. Whereas the other intelligence available, humans, are not reliable at all, outside of at most narrow particular contexts. Your AI personal assistant will soon be at least as reliable as a human assistant would be.
I disagree that humans are unreliable, and would go the opposite way, that humans are very reliable agents, and a lot of why AI hasn’t been used as much as the tech world thinks is that reliability matters way more than LWers thought it would in many domains.
This is a problem scaling will eventually solve in the 2030s-2040s at the latest, but if Leopold Aschebrenner is incorrect about timelines in Situational Awareness, this is likely to be why.
See this post, which while exaggerating the reliability of humans, is IMO the best post on how humans are so reliable, and the numbers are IMO quite robust to 1-3 OOM errors.
Sam Bowman literally outlines the exact plan Eliezer Yudkowsky constantly warns not to use, and which the Underpants Gnomes know well.
Preparation (You are Here)
Making the AI Systems Do Our Homework (?????)
Life after TAI (Profit)
I’d replace the question marks by synthetic data, and relying on the fact that alignment generalizes further than capabilities, not the other way around if I were Sam Bowman and Anthropic.
The worry is that this is essentially saying ‘we do our jobs, solve alignment, it all works out.’ That doesn’t really tell us how to solve alignment, and has the implicit assumption that this is a ‘do your job’ or ‘row the boat’ (or even ‘play like a champion today’) situation. Whereas I see a very different style of problem. You do still have to execute, or you automatically lose. And if we execute on Bowman’s plan, we will be in a vastly better position than if we do not do that. But there is no script.
Agree that the checklist is unfortunately not a complete plan, but I think it can be turned into a complete, working plan if we add several more details.
That’s the thing. To me this does not make sense. How can you create machines that are smarter than humans, and not be at least ‘somewhat’ concerned that it ‘might’ pose a threat to humanity? What?
To steelman the Chinese view on AI where AI isn’t an existential threat, I think it’s helpful to notice several things:
Instrumental convergence is not as strong as we feared, and there are real reasons to think that the instrumental convergence we do get out of powerful AIs are highly steerable thanks to them having very densely defined utility/reward functions. There’s a rather good chance that we can just do Retargeting the Search, though we’d need to get better at interpretability to make the most use out of it.
Synthetic data lets us control what the AI learns completely, and in particular lets us instill values into AIs very deeply before they have a chance to deceive against you, and there are strong reasons to expect synthetic data loops to be the main source of data for AI, because real human data is both kind of poor and likely not enough for AI by 2028 (for text data.)
Alignment likely generalizes further than capabilities, both because verification is way, way easier than generation, plus combined with the fact that we can afford to explore less in the space of values, combined with in practice reward models for humans being easier than capabilities strongly points to alignment generalizing further than capabilities.
I get your frustration of a lot of discourse, and I agree that a lot of people disagreeing on AI as a threat are really disagreeing about how far AI capabilities can go.
But this is not all of the criticism, like for example my own criticism of AI as a threat, and it pays to listen to the nuanced arguments against AI doom.
But this is untrue in practice (observe that models do not become suddenly useless after they’re jailbroken) and unlikely in practice (since capabilities come by default, when you learn to predict reality, but alignment does not; why would predicting reality lead to having preferences that are human-friendly? And the post-training “alignment” that AI labs are performing seems like it’d be quite unfriendly to me, if it did somehow generalize to superhuman capabilities). Also, whether or not it’s true, it is not something I’ve heard almost any employee of one of the large labs claim to believe (minus maybe TurnTrout? not sure if he’s endorse it or not).
both because verification is way, way easier than generation, plus combined with the fact that we can afford to explore less in the space of values, combined with in practice reward models for humans being easier than capabilities strongly points to alignment generalizing further than capabilities
This is not what “generalizes futher” means. “Generalizes further” means “you get more of it for less work”.
why would predicting reality lead to having preferences that are human-friendly?
LLMs are not trained to predict reality — they’re trained to predict human-generated text, i.e. we’re distilling human intelligence into them. This gets you something that uses human ontologies, understands human preferences and values in great detail, acts agentically, and works more sloppily in August.
The problem here for ASI is that while humans understand human values well, not all (perhaps even not many) humans are extremely moral or kindly or wise, or safe be handed godlike intelligence, enormous power, and the ability to run rings around law-enforcement. The same is by default going to be true of an artificial intelligence distilled from humans. As for “having preferences”, an LLM doesn’t simulate a single human (or their preferences), for each request it simulates a new randomly selected member of a prompt-dependent distribution of possible humans (and their preferences).
The problem here for ASI is that while humans understand human values well, not all (perhaps even not many) humans are extremely moral or kindly or wise, or safe be handed godlike intelligence, enormous power, and the ability to run rings around law-enforcement.
This is why I think synthetic data, as well as not open-sourcing/open-weighting ASI is likely to be necessary, at least for a few years, because we cannot have merely as well as human alignment of ASI, but the good news is that synthetic data is a very natural path to increasing capabilities for AI in general, not just LLMs, and I’m more hopeful than you that we can get instruction following AGI/ASI to automate alignment research.
But this is untrue in practice (observe that models do not become suddenly useless after they’re jailbroken)
Note that just because alignment generalizes further than capabilities doesn’t mean someone attempting to jailbreak the model means the model becomes useless.
I’m just saying it’s harder to optimize in the world than to learn human values, not that capability is suddenly lost if you try to jailbreak them.
Also, a lot of the jailbreaking attempts are very clear examples of misuse problems, not misalignment problems, at least in my view. The fact that the goals from humans are often to produce offensive/sexual stuff is enough to show that they are very minor misuse problems.
and unlikely in practice (since capabilities come by default, when you learn to predict reality, but alignment does not; why would predicting reality lead to having preferences that are human-friendly? And the post-training “alignment” that AI labs are performing seems like it’d be quite unfriendly to me, if it did somehow generalize to superhuman capabilities).
Basically, because there is a lot of data on human values, and since models are heavily, heavily influenced by their data sources as well as data quantity, it means that learning and having human values comes mostly for free via unsupervised learning.
More generally, my big thesis is if you want to understand how an AI is aligned, or what capabilities an AI has, the data matter a lot, at least compared to the prior.
Re labs alignment methods, the good news is that we likely have better methods than the post-training RLHF and RLAIF, because synthetic data lets us shape the AI model’s values early in training, before it knows how to deceive, and thus is probably able to shape the incentives away from deception.
Also, whether or not it’s true, it is not something I’ve heard almost any employee of one of the large labs claim to believe (minus maybe TurnTrout? not sure if he’s endorse it or not).
Yep, this is absolutely a hot take of mine, though the true origin was from Beren Millidge, but one that I do think has a reasonable chance of being correct.
This is not what “generalizes futher” means. “Generalizes further” means “you get more of it for less work”.
Yeah, this is also what I meant, and my point is that you get more alignment with less work both because of the amount of data, and AIs will almost certainly update and be influenced quite largely on data, combined with the fact that it’s easier to learn and implement values than it is to get the equivalent amount of capability.
I’m just saying it’s harder to optimize in the world than to learn human values
Leaning what human values are is of course part of a subset of learning about reality, but also doesn’t really have anything to do with alignment (as describing an agent’s tendency to optimize for states of the world that humans would find good).
I think where I disagree is that I do think value learning and learning about human values is quite obviously very important for alignment, both because a lot of alignment approaches depended on value learning working, and because the data heavily influences what it trys to to optimize.
Another way to say it is that the data strongly influences the optimization target, because a large portion of both capabilities and alignment is strongly downstream of the data it’s trained on, so I don’t see why learning what human values are is unrelated to this:
(as describing an agent’s tendency to optimize for states of the world that humans would find good).
Here are some of my comments on this post:
I disagree that humans are unreliable, and would go the opposite way, that humans are very reliable agents, and a lot of why AI hasn’t been used as much as the tech world thinks is that reliability matters way more than LWers thought it would in many domains.
This is a problem scaling will eventually solve in the 2030s-2040s at the latest, but if Leopold Aschebrenner is incorrect about timelines in Situational Awareness, this is likely to be why.
See this post, which while exaggerating the reliability of humans, is IMO the best post on how humans are so reliable, and the numbers are IMO quite robust to 1-3 OOM errors.
https://www.lesswrong.com/posts/28zsuPaJpKAGSX4zq/humans-are-very-reliable-agents
This is likely to be false, see here:
https://x.com/business/status/1831428052615622672
I’d replace the question marks by synthetic data, and relying on the fact that alignment generalizes further than capabilities, not the other way around if I were Sam Bowman and Anthropic.
Agree that the checklist is unfortunately not a complete plan, but I think it can be turned into a complete, working plan if we add several more details.
To steelman the Chinese view on AI where AI isn’t an existential threat, I think it’s helpful to notice several things:
Instrumental convergence is not as strong as we feared, and there are real reasons to think that the instrumental convergence we do get out of powerful AIs are highly steerable thanks to them having very densely defined utility/reward functions. There’s a rather good chance that we can just do Retargeting the Search, though we’d need to get better at interpretability to make the most use out of it.
Synthetic data lets us control what the AI learns completely, and in particular lets us instill values into AIs very deeply before they have a chance to deceive against you, and there are strong reasons to expect synthetic data loops to be the main source of data for AI, because real human data is both kind of poor and likely not enough for AI by 2028 (for text data.)
See here for more:
https://www.beren.io/2024-05-11-Alignment-in-the-Age-of-Synthetic-Data/
Alignment likely generalizes further than capabilities, both because verification is way, way easier than generation, plus combined with the fact that we can afford to explore less in the space of values, combined with in practice reward models for humans being easier than capabilities strongly points to alignment generalizing further than capabilities.
See here for a bit more on this:
https://www.beren.io/2024-05-15-Alignment-Likely-Generalizes-Further-Than-Capabilities/
I get your frustration of a lot of discourse, and I agree that a lot of people disagreeing on AI as a threat are really disagreeing about how far AI capabilities can go.
But this is not all of the criticism, like for example my own criticism of AI as a threat, and it pays to listen to the nuanced arguments against AI doom.
But this is untrue in practice (observe that models do not become suddenly useless after they’re jailbroken) and unlikely in practice (since capabilities come by default, when you learn to predict reality, but alignment does not; why would predicting reality lead to having preferences that are human-friendly? And the post-training “alignment” that AI labs are performing seems like it’d be quite unfriendly to me, if it did somehow generalize to superhuman capabilities). Also, whether or not it’s true, it is not something I’ve heard almost any employee of one of the large labs claim to believe (minus maybe TurnTrout? not sure if he’s endorse it or not).
This is not what “generalizes futher” means. “Generalizes further” means “you get more of it for less work”.
LLMs are not trained to predict reality — they’re trained to predict human-generated text, i.e. we’re distilling human intelligence into them. This gets you something that uses human ontologies, understands human preferences and values in great detail, acts agentically, and works more sloppily in August.
The problem here for ASI is that while humans understand human values well, not all (perhaps even not many) humans are extremely moral or kindly or wise, or safe be handed godlike intelligence, enormous power, and the ability to run rings around law-enforcement. The same is by default going to be true of an artificial intelligence distilled from humans. As for “having preferences”, an LLM doesn’t simulate a single human (or their preferences), for each request it simulates a new randomly selected member of a prompt-dependent distribution of possible humans (and their preferences).
This is why I think synthetic data, as well as not open-sourcing/open-weighting ASI is likely to be necessary, at least for a few years, because we cannot have merely as well as human alignment of ASI, but the good news is that synthetic data is a very natural path to increasing capabilities for AI in general, not just LLMs, and I’m more hopeful than you that we can get instruction following AGI/ASI to automate alignment research.
Completely agreed (and indeed currently looking for employment where I could work on just that).
Note that just because alignment generalizes further than capabilities doesn’t mean someone attempting to jailbreak the model means the model becomes useless.
I’m just saying it’s harder to optimize in the world than to learn human values, not that capability is suddenly lost if you try to jailbreak them.
Also, a lot of the jailbreaking attempts are very clear examples of misuse problems, not misalignment problems, at least in my view. The fact that the goals from humans are often to produce offensive/sexual stuff is enough to show that they are very minor misuse problems.
Basically, because there is a lot of data on human values, and since models are heavily, heavily influenced by their data sources as well as data quantity, it means that learning and having human values comes mostly for free via unsupervised learning.
More generally, my big thesis is if you want to understand how an AI is aligned, or what capabilities an AI has, the data matter a lot, at least compared to the prior.
Re labs alignment methods, the good news is that we likely have better methods than the post-training RLHF and RLAIF, because synthetic data lets us shape the AI model’s values early in training, before it knows how to deceive, and thus is probably able to shape the incentives away from deception.
More on that here:
https://www.beren.io/2024-05-11-Alignment-in-the-Age-of-Synthetic-Data/
Yep, this is absolutely a hot take of mine, though the true origin was from Beren Millidge, but one that I do think has a reasonable chance of being correct.
Yeah, this is also what I meant, and my point is that you get more alignment with less work both because of the amount of data, and AIs will almost certainly update and be influenced quite largely on data, combined with the fact that it’s easier to learn and implement values than it is to get the equivalent amount of capability.
Leaning what human values are is of course part of a subset of learning about reality, but also doesn’t really have anything to do with alignment (as describing an agent’s tendency to optimize for states of the world that humans would find good).
I think where I disagree is that I do think value learning and learning about human values is quite obviously very important for alignment, both because a lot of alignment approaches depended on value learning working, and because the data heavily influences what it trys to to optimize.
Another way to say it is that the data strongly influences the optimization target, because a large portion of both capabilities and alignment is strongly downstream of the data it’s trained on, so I don’t see why learning what human values are is unrelated to this: