Can you say slightly more detail about how you think the preference synthesizer thing is suposed to work?
zhukeepa
Well, yeah. An idealized version would be like a magic box that’s able to take in a bunch of people with conflicting preferences about how they ought to coordinate (for example, how they should govern their society), figure out a synthesis of their preferences, and communicate this synthesis to each person in a way that’s agreeable to them.
...
Ben Pace
Okay. So, you want a preference synthesizer, or like a policy-outputter that everyone’s down for?
zhukeepa
Yes, with a few caveats, one being that I think preference synthesis is going to be a process that unfolds over time, just like truth-seeking dialogue that bridges different worldviews.
…
zhukeepa
Yeah. I think the thing I’m wanting to say right now is a potentially very relevant detail in my conception of the preference synthesis process, which is that to the extent that individual people in there have deep blind spots that lead them to pursue things that are at odds with the common good, this process would reveal those blind spots while also offering the chance to forgive them if you’re willing to accept it and change.
I may be totally off, but whenever I read you (zhukeepa) elaborating on the preference synthesizer idea I kept thinking of democratic fine-tuning (paper: What are human values, and how do we align AI to them?), which felt like it had the same vibe. It’s late night here so I’ll butcher their idea if I try to explain them, so instead I’ll just dump a long quote and a bunch of pics and hope you find it at least tangentially relevant:
We report on the first run of “Democratic Fine-Tuning” (DFT), funded by OpenAI. DFT is a democratic process that surfaces the “wisest” moral intuitions of a large population, compiled into a structure we call the “moral graph”, which can be used for LLM alignment.
We show bridging effects of our new democratic process. 500 participants were sampled to represent the US population. We focused on divisive topics, like how and if an LLM chatbot should respond in situations like when a user requests abortion advice. We found that Republicans and Democrats come to agreement on values it should use to respond, despite having different views about abortion itself.
We present the first moral graph, generated by this sample of Americans, capturing agreement on LLM values despite diverse backgrounds.
We present good news about their experience: 71% of participants said the process clarified their thinking, and 75% gained substantial respect for those across the political divide.
Finally, we’ll say why moral graphs are better targets for alignment than constitutions or simple rules like HHH. We’ll suggest advantages of moral graphs in safety, scalability, oversight, interpretability, moral depth, and robustness to conflict and manipulation.
Our goal with DFT is to make one fine-tuned model that works for Republicans, for Democrats, and in general across ideological groups and across cultures; one model that people all around the world can all consider “wise”, because it’s tuned by values we have broad consensus on. We hope this can help avoid a proliferation of models with different tunings and without morality, fighting to race to the bottom in marketing, politics, etc. For more on these motivations, read our introduction post.
To achieve this goal, we use two novel techniques: First, we align towards values rather than preferences, by using a chatbot to elicit what values the model should use when it responds, gathering these values from a large, diverse population. Second, we then combine these values into a “moral graph” to find which values are most broadly considered wise.
Example moral graph, which “charts out how much agreement there is that any one value is wiser than another”:
Also, “people endorse the generated cards as representing their values—in fact, as representing what they care about even more than their prior responses. We paid for a representative sample of the US (age, sex, political affiliation) to go through the process, using Prolific. In this sample, we see a lot of convergence. As we report further down, people overwhelmingly felt well-represented with the cards, and say the process helped them clarify their thinking”, which is why I paid attention to DFT at all:
Yeah, I also see broad similarities between my vision and that of the Meaning Alignment people. I’m not super familiar with the work they’re doing, but I’m pretty positive on the the little bits of it I’ve encountered. I’d say that our main difference is that I’m focusing on ungameable preference synthesis, which I think will be needed to robustly beat Moloch. I’m glad they’re doing what they’re doing, though, and I wouldn’t be shocked if we ended up collaborating at some point.
I may be totally off, but whenever I read you (zhukeepa) elaborating on the preference synthesizer idea I kept thinking of democratic fine-tuning (paper: What are human values, and how do we align AI to them?), which felt like it had the same vibe. It’s late night here so I’ll butcher their idea if I try to explain them, so instead I’ll just dump a long quote and a bunch of pics and hope you find it at least tangentially relevant:
Example moral graph, which “charts out how much agreement there is that any one value is wiser than another”:
Also, “people endorse the generated cards as representing their values—in fact, as representing what they care about even more than their prior responses. We paid for a representative sample of the US (age, sex, political affiliation) to go through the process, using Prolific. In this sample, we see a lot of convergence. As we report further down, people overwhelmingly felt well-represented with the cards, and say the process helped them clarify their thinking”, which is why I paid attention to DFT at all:
Yeah, I also see broad similarities between my vision and that of the Meaning Alignment people. I’m not super familiar with the work they’re doing, but I’m pretty positive on the the little bits of it I’ve encountered. I’d say that our main difference is that I’m focusing on ungameable preference synthesis, which I think will be needed to robustly beat Moloch. I’m glad they’re doing what they’re doing, though, and I wouldn’t be shocked if we ended up collaborating at some point.