Making LLMs safer is more intuitive than you think: How Common Sense and Diversity Improve AI Alignment

AI safety isn’t purely technical; it’s also about applying common sense and human reasoning. Using reasoning techniques from around the world instead of just the Global North, we can better align AI with human values. If you are interested in AI safety but have an untraditional background or skill set, don’t fret. That’s precisely why your ideas are needed.


Creating effective AI alignment methods is more intuitive now

Richard Ngo, a well-known AI governance researcher, defines AI alignment as:

“ensuring that AI systems pursue goals that match human values or interests rather than unintended and undesirable goals.”

Before generative AI, AI researchers primarily aligned models by focusing on carefully curating training data. By preventing models from picking up biases present in training data, researchers were more confident that models would not make discriminatory decisions when deployed.

Generative AI has changed everything.

Traditionally, researchers translate their objectives into a mathematical loss function and incentivize the model to minimize that function. However, generative AI is open-ended, and that makes alignment tricky.

Human values such as “ethical,” “fair,” or “harmful” defy being reduced into simple mathematical expressions, which means Generative AI models operate in environments with ambiguous objectives—like we do.

While this seems like a problem, we can actually think about AI alignment much more intuitively now. We can think about AI alignment much like human alignment—a task we have millennia of experience with.

In the next section, you’ll learn that many of today’s AI alignment methods rely heavily on common sense reasoning. These methods mirror how humans might reinforce stronger values for non-AI beings.


Let’s dive into three popular methods. I’ll break down the key ideas behind them and provide everyday analogies.

1. Constitutional AI

Constitutional AI is a technique used by Anthropic’s Claude model that has proven surprisingly robust. Its goal is to ensure LLMs create responses that are as harmless as they are helpful.

The Basic IdeaEveryday Analogy
  1. Have a human define a constitution or a set of values you’d like the AI to consider, i.e., fairness, equality, politeness

  2. Prompt the AI with a question

  3. Ask the AI to self-critique its response according to the constitution you set

  4. Have the AI provide an improved response.

  5. Feed the improved response back into the AI through reinforcement learning (this is the machine learning-heavy aspect)

Say you are a writing teacher.

  1. Define a rubric for your students ie, flow, clarity, formality

  2. Prompt your students to submit a short essay

  3. Ask your students to then critique their draft according to the rubric you set

  4. Have your seconds improve their draft based on their self-critique

  5. Ask the students to keep this rubric in mind as they complete the rest of the course

Easy enough, right? If you’re curious, you can read about the exact constitution Anthropic used to develop Claude here. Let’s move on to another method.

2. Task Decomposition: Iterated Amplification

Task decomposition aims to help humans better judge AI’s responses by decomposing responses and evaluating each part individually.

It is useful when AIs are prompted to solve complex problems where the solutions are difficult for humans to judge holistically (e.g., create a maximally optimal traffic system for New York City).

The Basic Idea Everyday Analogy
  1. Breakdown a complicated task into smaller defined sub-tasks

  2. Create multiple copies of the AI model

  3. Assign one subtask to each copy of the AI model

  4. Have a human provide individual feedback on the models’ performance on each subtask

  5. Update the overall AI system by using that feedback as future training data

  6. Once all subtasks are satisfactorily solved, combine the solutions to solve the original, larger problem.

Say you are a bakery owner who makes wedding cakes for hundreds of weddings a year and wants to judge the quality of your business:

  1. Break down your task into sub-tasks, i.e., prepping, baking, frosting, delivering, client management

  2. Recruit many similarly skilled workers and assign each one a sub-task, i.e. one froster, one delivery person, etc

  3. Provide feedback on each worker’s initial attempts

  4. Ask the workers to keep your feedback in mind throughout the year to improve their skills

  5. Analyze your business as a whole by analyzing your worker’s outputs

To dive deeper into Iterated Amplication, you can watch a fantastic explanation by science communicator Rob Miles here.

3. Debate

Debate is a useful technique to ensure individual AI systems are not deceiving humans. Debate can also guard against lying or manipulation to receive human approval (see sycophantic behavior).

The Basic Idea Everyday Analogy
  1. Prompt two AI models with a question

  2. The models will provide their answers to a human evaluator and to eachother.

  3. The two models will provide reasoning for their answers, attempting to outdo the other model and give the best answer

  4. A human judge evaluates the reasoning and decides which agent wins

Say you are a parent to two sweet but mischievous teenagers.

  1. Ask both of them who left a scratch on your car

  2. Both teens say, “Not me”

  3. The teens will compete by mounting a defense and blaming the other, attempting to poke holes in the other’s defense

  4. You will judge whose defense is the most sound and is, therefore, innocent

You may have noticed that the problems at the heart of each method are basic reasoning problems, such as self-reflection or problem simplification. Constitutional AI centers around improving self-evaluation through a set of guiding values. Task decomposition is a straightforward approach to solving and evaluating a complex solution. Debate is about preventing negative collaboration and providing proof of reasoning. By applying common sense and imagining AI as if it were a non-AI being, we can focus on creatively solving basic reasoning challenges for LLMs.

These are just three alignment methods. If you’d like to learn more alignment methods that use common reasoning techniques, start here.


Another method, Reinforcement Learning from Human Feedback (RLHF), is the top industry choice to align AI systems with human values. Many AI companies, such as OpenAI and Scale AI, rely on RLHF.

However, RLHF’s alignment performance leaves much to be desired. Maybe not uncoincidentally, RLHF heavily depends on mathematical loss expressions during alignment. This graphic from an Anthropic paper details the performance limitations of RHLF compared to Constitutional AI [1].

Anthropic researchers compared AI training methods on two axes: helpfulness and harmlessness. After a certain point, standard RLHF faces a tradeoff between being helpful and being harmless. In contrast, Constitutional AI can improve both metrics simultaneously, demonstrating the potential for more reliable alignment.

We should look to other sources of inspiration for reasoning techniques.

You should join this effort if you can think of more creative reasoning techniques. You may have started to brainstorm other reasoning frameworks while reading. You should especially join if you believe your ideas are obvious and wonder why no one has implemented them.

The truth is that most AI alignment research is geographically concentrated in certain regions of the Global North. Researchers today likely have similar academic backgrounds and training. They may even share the same languages, cultures, religions, and ethnicities.

Breakthroughs in AI safety will require diverse perspectives, experiences, and modes of thinking.

We should look towards reasoning and decision-making techniques around the globe for inspiration. AI needs to be safe for everyone, so AI safety should be a globally representative field. AI alignment should even go beyond focusing solely on humans, ensuring AI is safe for the environment and animals.

We can’t rely on a small subset of the population to develop the best techniques. This approach will risk marginalizing all others impacted by generative AI in the coming years and deprive the field itself of transformative safety advances.

The good thing is inspiration is all around us if we look closely enough. Here are three diverse sources of inspiration which AI alignment could draw from and three examples of practical frameworks they might lead to.

Inspiration from Culture: Haudenosaunee Seven Generations Principle

The Inspiration

This Native American philosophy emphasizes decision-making that benefits the present and the next seven generations [2]. From the Haudenosaunee (Iroquois) Great Law of Peace, this approach ensured future descendants were not voiceless. In practice, the principle prioritized sustainability and continued responsibility for the welfare of people.

Application to AI Alignment

Models could incorporate long-term predictions, ensuring decisions align with future sustainability and impact goals. These systems could simulate the downstream consequences of decisions over extended periods, leading to more informed decision-making.

Inspiration from Nature: Apoptosis

The Inspiration

Apoptosis refers to programmed cell death. This protective mechanism replaces abnormal, damaged, or aging cells with younger and healthier cells. Without apoptosis, uncontrollable cell growth can lead to life-threatening diseases like cancer.

Application to AI Alignment

Researchers implement mechanisms where AI models autonomously recognize misalignment, shut down harmful behaviors, or even self-destruct (via wiping out network weights) for severe cases.

Inspiration from Governance Models: Checks and Balances

The Inspiration

Checks and balances are a popular governance concept used by the US government and many global institutions such as the International Criminal Court. This concept refers to any system of independent bodies within a single organization that counterbalances each other’s influence, ensuring no single body has concentrated power or authority.

Application to AI Alignment

We could build modular AI systems with distinct sub-components focusing on different objectives (e.g., overall goal, ethical considerations, social implications). These agents could check each other’s outputs, flagging potential high-risk conflicts or misalignment.


Improving AI is more intuitive than you think, so the barrier to providing useful perspectives is lower than you think.

AI alignment is far more intuitive than it may initially seem. We can make meaningful strides by drawing on familiar reasoning and decision-making frameworks we use daily.

Thankfully, inspiration is all around us. We can source innovative reasoning techniques by being open to learning from other cultures, nature, and governance systems that have endured hundreds or even thousands of years.

This challenge isn’t just for ML researchers or technologists. If you bring a unique background or creative perspective, your contribution is exactly what’s needed to tackle AI alignment.

You might have the answers to make AI safer for everyone.


References

[1] Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., Joseph, N., Kadavath, S., Kernion, J., Conerly, T., Elhage, N., Hernandez, D., Hume, T., Johnston, S., Kravec, S., . . . Kaplan, J. (2022). Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. ArXiv. https://​​arxiv.org/​​abs/​​2204.05862

[2] Joseph, B. (2024, April 2). What is the seventh generation principle? Indigenous Corporate Training Inc. https://​​www.ictinc.ca/​​blog/​​seventh-generation-principle#:~:text=The%20Seventh%20Generation%20Principle%20is,a%20people%20considered%20%E2%80%9Csavages%E2%80%9D

No comments.