But this is untrue in practice (observe that models do not become suddenly useless after they’re jailbroken)
Note that just because alignment generalizes further than capabilities doesn’t mean someone attempting to jailbreak the model means the model becomes useless.
I’m just saying it’s harder to optimize in the world than to learn human values, not that capability is suddenly lost if you try to jailbreak them.
Also, a lot of the jailbreaking attempts are very clear examples of misuse problems, not misalignment problems, at least in my view. The fact that the goals from humans are often to produce offensive/sexual stuff is enough to show that they are very minor misuse problems.
and unlikely in practice (since capabilities come by default, when you learn to predict reality, but alignment does not; why would predicting reality lead to having preferences that are human-friendly? And the post-training “alignment” that AI labs are performing seems like it’d be quite unfriendly to me, if it did somehow generalize to superhuman capabilities).
Basically, because there is a lot of data on human values, and since models are heavily, heavily influenced by their data sources as well as data quantity, it means that learning and having human values comes mostly for free via unsupervised learning.
More generally, my big thesis is if you want to understand how an AI is aligned, or what capabilities an AI has, the data matter a lot, at least compared to the prior.
Re labs alignment methods, the good news is that we likely have better methods than the post-training RLHF and RLAIF, because synthetic data lets us shape the AI model’s values early in training, before it knows how to deceive, and thus is probably able to shape the incentives away from deception.
Also, whether or not it’s true, it is not something I’ve heard almost any employee of one of the large labs claim to believe (minus maybe TurnTrout? not sure if he’s endorse it or not).
Yep, this is absolutely a hot take of mine, though the true origin was from Beren Millidge, but one that I do think has a reasonable chance of being correct.
This is not what “generalizes futher” means. “Generalizes further” means “you get more of it for less work”.
Yeah, this is also what I meant, and my point is that you get more alignment with less work both because of the amount of data, and AIs will almost certainly update and be influenced quite largely on data, combined with the fact that it’s easier to learn and implement values than it is to get the equivalent amount of capability.
I’m just saying it’s harder to optimize in the world than to learn human values
Leaning what human values are is of course part of a subset of learning about reality, but also doesn’t really have anything to do with alignment (as describing an agent’s tendency to optimize for states of the world that humans would find good).
I think where I disagree is that I do think value learning and learning about human values is quite obviously very important for alignment, both because a lot of alignment approaches depended on value learning working, and because the data heavily influences what it trys to to optimize.
Another way to say it is that the data strongly influences the optimization target, because a large portion of both capabilities and alignment is strongly downstream of the data it’s trained on, so I don’t see why learning what human values are is unrelated to this:
(as describing an agent’s tendency to optimize for states of the world that humans would find good).
Note that just because alignment generalizes further than capabilities doesn’t mean someone attempting to jailbreak the model means the model becomes useless.
I’m just saying it’s harder to optimize in the world than to learn human values, not that capability is suddenly lost if you try to jailbreak them.
Also, a lot of the jailbreaking attempts are very clear examples of misuse problems, not misalignment problems, at least in my view. The fact that the goals from humans are often to produce offensive/sexual stuff is enough to show that they are very minor misuse problems.
Basically, because there is a lot of data on human values, and since models are heavily, heavily influenced by their data sources as well as data quantity, it means that learning and having human values comes mostly for free via unsupervised learning.
More generally, my big thesis is if you want to understand how an AI is aligned, or what capabilities an AI has, the data matter a lot, at least compared to the prior.
Re labs alignment methods, the good news is that we likely have better methods than the post-training RLHF and RLAIF, because synthetic data lets us shape the AI model’s values early in training, before it knows how to deceive, and thus is probably able to shape the incentives away from deception.
More on that here:
https://www.beren.io/2024-05-11-Alignment-in-the-Age-of-Synthetic-Data/
Yep, this is absolutely a hot take of mine, though the true origin was from Beren Millidge, but one that I do think has a reasonable chance of being correct.
Yeah, this is also what I meant, and my point is that you get more alignment with less work both because of the amount of data, and AIs will almost certainly update and be influenced quite largely on data, combined with the fact that it’s easier to learn and implement values than it is to get the equivalent amount of capability.
Leaning what human values are is of course part of a subset of learning about reality, but also doesn’t really have anything to do with alignment (as describing an agent’s tendency to optimize for states of the world that humans would find good).
I think where I disagree is that I do think value learning and learning about human values is quite obviously very important for alignment, both because a lot of alignment approaches depended on value learning working, and because the data heavily influences what it trys to to optimize.
Another way to say it is that the data strongly influences the optimization target, because a large portion of both capabilities and alignment is strongly downstream of the data it’s trained on, so I don’t see why learning what human values are is unrelated to this: