Model Integrity: MAI on Value Alignment

Link post

EVERYONE, CALM DOWN!

Meaning Alignment Institute just dropped their first post in basically a year and it seems like they’ve been up to some cool stuff.

Their perspective on value alignment really grabbed my attention because it reframes our usual technical alignment conversations around rules and reward functions into something more fundamental—what makes humans actually reliably good and cooperative?

I really like their frame of a moral graph and locally maximally good values to follow as another way of imagining alignment, it is a lot more similar to that which happened during cultural evolution as explored in for example The Secret of Our Success. It kind of seems like they’re taking evolutionary psychology and morality research and group selection and applying the results to how to align models and I’m all for it.

It could be especially relevant for thorny problems like multi-agent coordination—just as humans with shared values can cooperate effectively even without explicit rules, AI systems might achieve more robust coordination through genuine internalization of values rather than pure game theory or rule-following.

This is part of my take nowadays, we need more work on things that work in grayer, multi-agent scenarios as we’re likely going into a multi-polar future with some degree of slower takeoff.