If you want to chat, message me!
LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.
If you want to chat, message me!
LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.
From a ‘real alignment’ perspective (how to get the AI to want to do good things and not bad things), I think there are some obvious implications for the future of RLAIF.
You might think of the label ‘RLAIF’ as standing in for the general strategy of leveraging unsupervised data about human behavior to point the AI towards human preferences, using a scaffold that solicits the AI’s predictions (or more general generative output, if the training isn’t for pure prediction) about human preference-laden behaviors, and then transforms those predictions into some sort of supervisory signal.
Similarly, the AZR setup leverages the AI’s unsupervised knowledge of code-quality-laden behaviors, using a scaffold that turns them back into a reward signal that lets the AI quote-unquote “train itself” to code better. Except that relative to vanilla RLAIF, there’s more of an emphasis on generating and solving specific problems that form a curriculum for the agent, rather than just responding well to samples from the training distribution. But now that I’ve described things in this way, you can probably see how to turn this back into RLAIF for alignment.
The overarching problem is, as usual, we don’t understand how to do alignment in a non-hacky way.
We don’t know what sorts of moral reflection are necessary for good outcomes, and we don’t know where human feedback is a necessary ingredient to keep AI meta-ethical evolution grounded to human preferences. But hey, if we try various value learning schemes empirically maybe we’ll learn some things.
We should probably distinguish between a bunch of different things, such as: the ‘race’ between capabilities research and alignment research, AI supervision and control, some notion of “near-term default alignedness” of AIs, interpretability, and value alignment.
I think it’s pretty obvious what’s going on with the race between capabilities and alignment.
This doesn’t have a big impact on AI supervision or control, because the way we currently plan to do these things involves mostly treating the AIs as black boxes—strategies like untrusted model supervision with a small percent of honeypots should still work about as well.
I agree this is bad for the near-term default alignedness of AI, because currently all RL is bad for the near-term default alignedness of AI. But note that this means I disagree with your reasoning: it’s not “control over its training” that’s the problem, it’s the distinction between learning from human data versus learning from a reward signal.
Probably not much impact on interpretability, I just included that for completeness.
The impact on alignment per se is an interesting question (if not about this paper specifically, then about this general direction of progress) that I think people have understandably dodged. I’ll try my hand at a top-level comment on that one.
I think it is fine to assume that the “true” (Coherent Extrapolated Volition) preferences of e.g. humans are transitive.
I think this could be safe or unsafe, depending on implementation details.
By your scare quotes, you probably recognize that there isn’t actually a unique way to get some “true” values out of humans. Instead, there are lots of different agent-ish ways to model humans, and these different ways of modeling humans will have different stuff in the “human values” bucket.
Rather than picking just one way to model humans, and following it and only it as the True Way, it seems a lot safer to build future AI that understands how this agent-ish modeling thing works, and tries to integrate lots of different possible notions of “human values” in a way that makes sense according to humans.
Of course, any agent that does good will make decisions, and so from the outside you can always impute transitive preferences over trajectories to this future AI. That’s fine. And we can go further, and say that a good future AI won’t do obviously-inconsistent-seeming stuff like do a lot of work to set something up and then do a lot of work to dismantle it (without achieving some end in the meantime—the more trivial the ends we allow, the weaker this condition becomes). That’s probably true.
But internally, I think it’s too hasty to say that a good future AI will end up representing humans as having a specific set transitive preferences. It might keep several incompatible models of human preferences in mind, and then aggregate them in a way that isn’t equivalent to any single set of transitive preferences on the part of humans (which means that in the future it might allow or even encourage humans to do some amount of obviously-inconsistent-seeming stuff).
It me.
If we’re talking about the domain where we can assume “good human input”, why do we need a solution more complicated than direct human supervision/demonstration (perhaps amplified by reward models or models of human feedback)? I mean this non-rhetorically; I have my own opinion (that debate acts as an unprincipled way of inserting one round of optimization for meta-preferences [if confusing, see here]), but it’s probably not yours.
Thanks for the post (and for linking the research agenda, which I haven’t yet read through)! I’m glad that, even if you use the framing of debate (which I don’t expect to pan out) to think about alignment, you still get to instrumental subproblems that would be broadly useful.
(If this post is “what would help make debate work for AI alignment,” you can also imagine framings “what would help make updating on human feedback work” [common ARC framing] and “what would help make model-based RL work” [common Charlie framing])
I’d put these subproblems into two buckets:
Developing theorems about specific debate ideas.
These are the most debate-specific.
Formalizing fuzzy notions.
By which I mean fuzzy notions that are kept smaller than the whole alignment problem, and so you maybe hope to get a useful formalization that takes some good background properties for granted.
I think there’s maybe a missing bucket, which is:
Bootstrapping or indirectly generating fuzzy notions.
If you allow a notion to grow to the size of the entire alignment problem (The one that stands out to me in your list is ‘systematic human error’ - if you make your model detailed enough, what error isn’t systematic?), then it becomes too hard to formalize first and apply second. You need to study how to safely bootstrap concepts, or how to get them from other learned processes.
Thanks, this was pretty interesting.
Big problem is the free choice of “conceptual language” (universal Turing machine) when defining simplicity/comprehensibility. You at various points rely on an assumption that there is one unique scale of complexity (one ladder of ), and it’ll be shared between the humans and the AI. That’s not necessarily true, which creates a lot of leaks where an AI might do something that’s simple in the AI’s internal representation but complicated in the human’s.
It’s OK to make cars pink by using paint (“spots of paint” is an easier to optimize/comprehend variable). It’s not OK to make cars pink by manipulating individual water droplets in the air to create an elaborate rainbow-like illusion (“individual water droplets” is a harder to optimize/comprehend variable).
This raises a second problem, which is the “easy to optimize” criterion, and how it might depend on the environment and on what tech tree unlocks (both physical and conceptual) the agent already has. Pink paint is pretty sophisticated, even though our current society has commodified it so we can take getting some for granted. Starting from no tech tree unlocks at all, you can probably get to hacking humans before you can recreate the Sherwin Williams supply chain. But if we let environmental availability weigh on “easy to to optimize,” then the agent will be happy to switch from real paint to a hologram or a human-hack once the technology for those becomes developed and commodified.
When the metric is a bit fuzzy and informal, it’s easy to reach convenient/hopeful conclusions about how the human-intended behavior is easy to optimize, but it should be hard to trust those conclusions.
I agree that trying to “jump straight to the end”—the supposedly-aligned AI pops fully formed out of the lab like Athena from the forehead of Zeus—would be bad.
And yet some form of value alignment still seems critical. You might prefer to imagine value alignment as the logical continuation of training Claude to not help you build a bomb (or commit a coup). Such safeguards seem like a pretty good idea to me. But as the model becomes smarter and more situationally aware, and is expected to defend against subversion attempts that involve more of the real world, training for this behavior becomes more and more value-inducing, to the point where it’s eventually unsafe unless you make advancements in learning values in a way that’s good according to humans.
Why train a helpful-only model?
If one of our key defenses against misuse of AI is good ol’ value alignment—building AIs that have some notion of what a “good purpose for them” is, and will resist attempts to subvert that purpose (e.g. to instead exalt the research engineer who comes in to work earliest the day after training as god-emperor) - then we should be able to close the security hole and never need to have a helpful-only model produced at any point during training. In fact, with blending of post-training into pre-training, there might not even be a need to ever produce a fully trained predictive-only model.
I’m big on point #2 feeding into point #1.
“Alignment,” used in a way where current AI is aligned—a sort of “it does basically what we want, within its capabilities, with some occasional mistakes that don’t cause much harm” sort of alignment—is simply easier at lower capabilities, where humans can do a relatively good job of overseeing the AI, not just in deployment but also during training. Systematic flaws in human oversight during training leads (under current paradigms) to misaligned AI.
If you can tell me in math what that means, then you can probably make a system that does it. No guarantees on it being distinct from a more “boring” specification though.
Here’s my shot: You’re searching a game tree, and come to a state that has some X and some Y. You compute a “value of X” that’s the total discounted future “value of Y” you’ll get, conditional on your actual policy, relative to a counterfactual where you have some baseline level of X. And also you compute the “value of Y,” which is the same except it’s the (discounted, conditional, relative) expected total “value of X” you’ll get. You pick actions to steer towards a high sum of these values.
A lot of the effect is picking high-hanging fruit.
Like, go to phys rev D now. There’s clearly a lot of hard work still going on. But that hard work seems to be getting less result, because they’re doing things like carefully calculating the trailing-order terms of the muon’s magnetic moment to get a change many decimal places down. (It turns out that this might be important for studying physics beyond the Standard Model. So this is good and useful work, definitely not being literally stalled.)
Another chunk of the effect is that you generally don’t know what’s important now. In hindsight you can look back and see all these important bits of progress woven into a sensible narrative. But research that’s being done right now hasn’t had time to earn its place in such a narrative. Especially if you’re an outside observer who has to get the narrative of research third-hand.
In the infographic, are “Leading Chinese Lab” and “Best public model”s’ numbers swapped? The best public model is usually said to be ahead of the Chinese.
EDIT: Okay, maybe most of it before the endgame is just unintuitive predictions. In the endgame, when the categories “best OpenBrain model,” “best public model” and “best Chinese model” start to become obsolete, I think your numbers are weird for different reasons and maybe you should just set them all equal.
Scott Wolchok correctly calls out me but also everyone else for failure to make an actually good definitive existential risk explainer. It is a ton of work to do properly but definitely worth doing right.
Reminder that https://ui.stampy.ai/ exists
The idea is interesting, but I’m somewhat skeptical that it’ll pan out.
RG doesn’t help much going backwards—the same coarse-grained laws might correspond to many different micro-scale laws, especially when you don’t expect the micro scale to be simple.
Singular learning theory provides a nice picture of phase-transition-like-phenomena, but it’s likely that large neural networks undergo lots and lots of phase transitions, and that there’s not just going to be one phase transition from “naive” to “naughty” that we can model simply.
Conversely, lots of important changes might not show up as phase transitions.
Some AI architectures are basically designed to be hard to analyze with RG because they want to mix information from a wide variety of scales together. Full attention layers might be such an example.
If God has ordained some “true values,” and we’re just trying to find out what pattern has received that blessing, then yes this is totally possible, God can ordain values that have their most natural description at any level He wants.
On the other hand, if we’re trying to find good generalizations of the way we use the notion of “our values” in everyday life, then no, we should be really confident that generalizations that have simple descriptions in terms of chemistry are not going to be good.
Thanks!
Any thoughts on how this line of research might lead to “positive” alignment properties? (i.e. Getting models to be better at doing good things in situations where what’s good is hard to learn / figure out, in contrast to a “negative” property of avoiding doing bad things, particularly in cases clear enough we could build a classifier for them.)
The second thing impacts the first thing :) If a lot of scheming is due to poor reward structure, and we should work on better reward structure, then we should work on scheming prevention.
Very interesting!
It would be interesting to know what the original reward models would say here—does the “screaming” score well according to the model of what humans would reward (or what human demonstrations would contain, depending on type of reward model)?
My suspicion is that the model has learned that apologizing, expressing distress etc after making a mistake is useful for getting reward. And also that you are doing some cherrypicking.
At the risk of making people do more morally grey things, have you considered doing a similar experiment with models finetuned on a math task (perhaps with a similar reward model setup for ease of comparison), which you then sabotage by trying to force them to give the wrong answer?
Classic post: https://www.lesswrong.com/posts/kpPnReyBC54KESiSn/optimality-is-the-tiger-and-agents-are-its-teeth