EVERYONE, CALM DOWN!
Meaning Alignment Institute just dropped their first post in basically a year and it seems like they’ve been up to some cool stuff.
Their perspective on value alignment really grabbed my attention because it reframes our usual technical alignment conversations around rules and reward functions into something more fundamental—what makes humans actually reliably good and cooperative?
I really like their frame of a moral graph and locally maximally good values to follow as another way of imagining alignment, it is a lot more similar to that which happened during cultural evolution as explored in for example The Secret of Our Success. It kind of seems like they’re taking evolutionary psychology and morality research and group selection and applying the results to how to align models and I’m all for it.
It could be especially relevant for thorny problems like multi-agent coordination—just as humans with shared values can cooperate effectively even without explicit rules, AI systems might achieve more robust coordination through genuine internalization of values rather than pure game theory or rule-following.
This is part of my take nowadays, we need more work on things that work in grayer, multi-agent scenarios as we’re likely going into a multi-polar future with some degree of slower takeoff.
I’m just going to leave my generic comment on value alignment: working on it is largely a waste of time, because Instruction-following AGI is easier and more likely than value aligned AGI.
Instruction-following includes issuing new instructions to correct mistaken interpretations of previous instructions (corrigibility). That alone makes it vastly safer than taking a swing at value alignment and hoping you got it right on the first try—that agent will true to pursue its values, whether you got the right ones in there or not.
Even if instruction-following wasn’t a lot safer, (which I’m pretty sure it is, at least in the short term), the people in charge of actually creating and deploying AGIs would use that logic as an excuse to make AGI follow their instructions and therefore their values, rather than the average values of all of humanity (whatever that would mean when you extrapolate it into an unknown future).
And this approach doesn’t prevent going for value alignment in the long term, it just allows superintelligent help in getting it right on the first try. See Intent alignment as a stepping-stone to value alignment.
I have yet to see any real counterarguments to this claim (other than that we should pursue value alignment because it’s better for all of humanity and would prevent misuse of AGI/ASI, which I agree with—as soon as it’s even modestly safe).
If this take is mistaken, I want to know.
Is your view closer to:
there’s two hard steps (instruction following, value alignment), and of the two instruction following is much more pressing
instruction following is the only hard step; if you get that, value alignment is almost certain to follow
The first. Value alignment is much harder. But it will be vastly easier with smarter-than-human help. So there are two difficult steps, and it’s clear which one should be tackled first.
The difficulty with value alignment is both in figuring out what we actually want, and then figuring out how to make those values stable in mind that changes as it learns new things.
I will try to give a longer answer tomorrow (11 pm my time now) but essentially I believe it will be useful for agentic AI with “heuristic”-like policies. I’m a bit uncertain about the validity of instruction like approaches here and for various reasons I believe multi-agent coordination will be easier through this method.
I’m sure that heuristic-like policies will be useful for multi-agent coordination.
Those heuristic policies can be given, and probably should, as instructions.
The problem is that if you set a capable agent in motion following heuristics or any other central motivation, you’ll probably discover that you got your definitions at least a little wrong—and those differences grow as they’re extrapolated further as the agent learns.
This is the classic problem with alignment. Half-steps toward making autonomous agents “do what we want” are very likely to lead toward disaster. Keeping a human in the loop by making instruction-following the central, top-level goal, and everything else we might want as a contingent subgoal.
This is the classic concept of corribibility. Being receptive to and following new instructions is the top-level goal that enables it.
Okay, so when I’m talking about values here, I’m actually not saying anything about policies as in utility theory or generally defined preference orderings.
I’m rather thinking of values as a class of locally arising heuristics or “shards” if you like that language that activate a certain set of belief circuits in the brain and similarly in an AI.
What do you mean more specifically when you say an instruction here? What should that instruction encompass? How do we interpret that instruction over time? How can we compare instructions to each other?
I think that instructions will become too complex to have good interpretability into especially for more complex multi-agent settings. How do we create interpretable multi-agent systems that we can change over time? I don’t believe that direct instruction tuning will be enough as you will have this problem that is for example described in Cooperation and Control in Delegation Games with AIs each having one person they get an instruction from but this not telling us anything about the multi-agent cooperation abilities of the agents in play.
I think this line of reasoning is valid for AI agents acting in a multi-agent setting where they gain more control over the economy through integration with general humans.
I completely agree with you that doing “pure value learning” is not the best right now but I think we need work in this direction to retain control over multiple AI Agents working at the same time.
I think deontology/virtue ethics makes societies more interpretable and corrigible, does that make sense? Also, I have this other belief that this will be the case and that it is more likely to get a sort of “cultural, multi-agent take-off” compared to a single agent.
Curious to hear what you have to say about that!
I think we’re talking about two different things here. You’re talking about how to have agents interact well with each other, and how to make their principles of interaction legible to humans. I’m talking about how to make sure those agents don’t take over the world and kill us all, if/when they become smarter than we are in every important way.
No I do think we care about the same thing, I just believe that this will happen in a multi-polar setting and so I believe that new forms of communication and multi-polar dynamics will be important for this.
Interpretability of these things is obviously important for changing those dynamics. ELK and similar things are important for the single agent case, why wouldn’t they be important for a multi-agent case?
There’s also always the possibility that you can elicit these sorts of goals and values from instructions and create a instruction based language around it that’s also relatively interpretable in what values are being prioritised in a multi-agent setting.
You do however get into ELK and misgeneralization problems here, IRL is not an easy task in general but there might be some neurosymbolic approaches that changes prompts to follow specific values?
I’m not sure if this is jibberish or not for you but my main frame for the next 5 years is “how do we steer collectives of AI agents in productive directions for humanity”.
Now I’ve actually skimmed the article. I was not wrong in my initial guess.
The central claim of the article:
Is comically poorly chosen.
Actual individuals I know have frequently thought they’d like a co-founder with integrity, only to change their mind as that co-founders’ integrity causes them to fight for control of the shared project.
This is an excellent metaphor for the alignment problem, and the problem with value alignment as a goal (see my other comment). If you get it just a little bit wrong, you’ll regret it. A lot. Even if your “co-founder with integrity” merely seizes control of your shared project peacefully, rather than doing what a superintelligence would, and doing it by any means necessary.
To put it another way: this proposal shows the problem with thinking about “aligning” language models and limited language model agent: if you don’t think about the next step, creating a fully competent, autonomous entity, your alignment efforts are dangerous, not useful.
What people really want is an assistant/cofounder that is highly competent and as autonomous as you tell them to be—until you tell them to stop. Integrity means pursuing your values even when your friends beg you to stop. That is exactly what we do not want from AGI.
In my opinion, theoretically, the key to have “safe” humans and “safe” models, is “to do no harm” under any circumstances, even when they have power. This is roughly what law is about, and what moral values should be about (in my opinion)