Building AI safety benchmark environments on themes of universal human values

Link post

This is an AI Safety Camp 10 project that I will be leading. With this post, I am looking for external collaborators, ideas, questions, resource suggestions, feedback, and other thoughts.

Summary

Based on various sources of anthropological research, I have compiled a preliminary list of universal (cross-cultural) human values. It seems to me that various of these universal values resonate with concepts from AI safety, but use different keywords. It might be useful to map these universal values to more concrete definitions using concepts from AI safety.

One notable detail in this research is that in case of AI and human cooperation, the values are not symmetric as they would be in case of human-human cooperation. This arises because we can change the goal composition of agents, but not of humans. Additionally there is the crucial difference that agents can be relatively easily cloned, while humans cannot. Therefore, for example, a human may have a universal need for autonomy, while an AI agent might imaginably not have that need built-in. If that works out, then the agent would instead have a need to support human autonomy.

The objective of this project would be to implement these mappings of concepts into tangible AI safety benchmark environments.

The non-summary

A related subject is balancing multiple human values (as the title says, it is in plural!). The human values and needs have to be met to a reasonable degree, that is, considering balancing all other human values as well. In this context, balancing is not the same as “tradeoff”. In some interpretations and use cases, tradeoff means linear rate of substitution between objectives, but as economists know well—generally humans prefer averages in all objectives to extremes in a few objectives. This means a naive approach of summing up the rewards of an AI agent would not yield aligned results. It is essential to use nonlinear utility functions for transforming the rewards before summing them up in the RL algorithm.

The current compiled list of universal human values is available in this document: “Universal ethical values—Survey of values”
https://​​docs.google.com/​​document/​​d/​​1ZZiToC149g9vKJGZRhktFmLYdB5J63nbClvCN_CxqAM/​​edit?usp=sharing (We may publish it as a separate LW post in the future).

It might be also interesting to consider how agents could internally represent the diversity of human needs, for which there are more than hundred words for representing various nuances. Take a look for example at this list of needs from the framework of Nonviolent Communication (scroll down to the second half of the webpage to see the list of needs): https://​​www.orvita.be/​​en/​​card/​​#:~:text=meaning%20(1)-,purpose,-goal%0Avision%0Adream . One of the central ideas of NVC is making distinction between expressed strategies /​​ stances versus implicit actual needs. The needs can be compared to ultimate values, while strategies are only instrumental values. One way to experiment with such scenarios would be utilising Sims. There have been LLM interfaces built for Sims. Among other Sims interfaces, you may want to take a look at this one: https://​​github.com/​​joonspk-research/​​generative_agents .

On a related note, in economics, there are inherently multi-objective and nonlinear concepts like diminishing returns, concave utility functions, marginal utility, indifference curves, convex preferences, complementary goods, Cobb-Douglas utilities, willingness to accept, and willingness to pay, prospect theory, etc. These and many other well known formulations and phenomena from economics need to be introduced to AI safety in order for both humans and agents to better understand and implement our preferences and values. When planning new benchmarks, we can include some themes derived from these utility and preference theories in economics as well. An utility monster-like AI would not only be unsafe, it would also be economically unsound.

For implementing these benchmarks, it might be helpful that I have created a convenient framework which enables implementing multi-agent multi-objective environments. This framework was built as an elaborate fork of DeepMind’s gridworlds framework. Additionally, I have already implemented about a dozen benchmarks using this framework, so the framework has been validated and these existing benchmarks can be also utilised as an example code for implementing the new environments. But we can also use different frameworks for implementing the benchmarks, if the team prefers so.

The multi-agent multi-objective gridworlds framework is available here: https://​​github.com/​​levitation-opensource/​​ai-safety-gridworlds This framework has been made compatible with PettingZoo ang Gym APIs, therefore testing AI agents on it is easy and follows industry standard interfaces. At the same time, the framework is extended from previously popular DeepMind’s Gridworlds, therefore enabling easy adoption of many existing gridworld environments and their conversion into multi-objective, multi-agent scenarios. You can see screenshots of the framework in this working paper: “From homeostasis to resource sharing: Biologically and economically compatible multi-objective multi-agent AI safety benchmarks” https://​​arxiv.org/​​abs/​​2410.00081 .

Motivation

The present-day rapid advancement of AI technologies necessitates the development of safe and reliable AI systems that align with human values. While notable progress has been made in defining and implementing safety protocols over the recent years, there remains a gap in integrating universal human values into AI safety benchmarks in a more systematic manner. My project aims to bridge this gap by planning and potentially building new multi-objective, multi-agent AI safety benchmark environments that incorporate themes of universal human values.

Drawing from extensive anthropological research, I’ve compiled a list of universal (cross-cultural) human values. These values often resonate with AI safety concepts but are expressed using different terminology. Mapping these universal values to concrete definitions using AI safety concepts can provide a more robust framework for developing safe AI systems. Likewise, we can then better note the kinds of universal human values that might not yet have a good coverage in the form of corresponding AI safety concepts. For example, human autonomy might be one of such potentially neglected concepts, which differs from the usually assumed power and achievement values—if an AI does all we ask for, or even more, before we even ask, then that might contradict our need for autonomy.

One critical aspect of this research is recognizing the asymmetry between AI and human cooperation. Unlike humans, we can alter the goal composition of AI agents and clone them relatively easily. This difference means that agents can be designed without certain intrinsic needs (e.g., autonomy) and instead be programmed to support human autonomy. They may still gain a limited need for autonomy because of instrumental reasons, but at least it might not need to be built-in.

Implementing and balancing the plurality of these universal human values is essential, as humans prefer a harmonious average across all objectives rather than extremes in a few.

Theory of Change

By integrating universal human values into AI safety benchmarks, we can develop AI agents that better understand and align with human needs. These benchmarks will serve as testing grounds for AI systems, ensuring they perform optimally across multiple objectives that reflect human values. This approach can reduce the risk of misalignment between AI behaviour and human expectations, thereby mitigating potential hazards associated with AGI/​TAI development.

Mostly this project aims at outer alignment. Though I think there are also a couple of ways how inner alignment can be affected.

First, my hypothesis is that if the AI is trained on sufficiently many objectives pulling in different directions, then it will be increasingly less likely that the model would overfit to some random objective. Instead, the model would hopefully find a middle ground between the objectives in the training data. This is similar to how old fashioned machine learning models overfit less when you have more data points. Even if the model still has some alien objectives inside it, these alien objectives would become drowned by the plurality of different human-values based objectives that were explicitly present in the training data.

Secondly, the way we formulate the mathematics of balancing multiple objectives is closer to the theme of inner alignment. The formulation of the model may affect its personality somewhat. Think for example about the difference between RL models and control systems models. The latter have the concept of optimal homeostatic values baked in, while with RL models you need to tweak their maximising nature somewhat. Likewise, we move closer to inner alignment work with the general understanding that we need to use nonlinear utility functions. In other words, linear summation of rewards across objectives without nonlinear transformations before summation would not be acceptable—it would lead to maximisation of a single easiest to achieve objective. With certain objectives or dynamics of these objectives, it might be easier to achieve outer alignment, if the agent also has approximately right inner alignment. You can read more about my earlier research on balancing from this paper: “Using soft maximin for risk averse multi-objective decision-making” https://​​link.springer.com/​​article/​​10.1007/​​s10458-022-09586-2 .

That being said, I definitely acknowledge the risk of treacherous turn or “sharp left turn”. I imagine that this risk can manifest in various ways and some of the related problems were the motivation why I became interested in AI safety in the first place. In my mind, the approaches we explore in this project are not intended to solve all problems. The approaches we implement are not exclusive to other AI safety approaches—various approaches can be combined in the future into a hybrid solution.

Project Plan

Steps Involved:

  1. Mapping Universal Human Values to AI Safety Concepts:

    • Analyse the compiled list of universal human values, as well as possibly the major types of needs from the NVC framework.

    • Identify corresponding AI safety concepts and objectives for each value.

    • Create a well structured mapping document to serve as a reference.

  2. Designing Benchmark Environments:

    • Conceptualise multi-agent, multi-objective environments that are relevant for the mapped values.

    • Define more specific scenarios inside these environments, where agents interact while considering multiple universal human values.

    • One methodology we could use is to map the values using a table with the following columns:

      1. Value description.

      2. Requirements describing when this value applies and how it should be met.

      3. Evidence describing in even more concrete and measurable terms, how to verify that requirements are met.

  3. Implementing Environments Using the Extended Gridworlds Framework:

    • Potentially utilising the existing multi-agent multi-objective gridworlds framework. Though we can also use alternate frameworks as well. My objective is to be relatively simple, but not simpler than would be adequate. Simplicity is necessary to avoid confounding factors and capability development unrelated to alignment. Second desiderata is repeatability and ability to restrict the scenarios. In contrast, LLM-based role games with a game master might be too open-ended. Gridworlds enables flexible simplicity, while allowing for use of symbols or icons that represent our culturally meaningful phenomena. That being said, gridworlds can be combined with LLM-based role games using a two-panel approach. In such a case the gridworld panel would represent the essential locality principles of physical consequences, navigation, and observation, while a parallel panel would contain the textual messages agents send to each other.

    • Develop the environments with code. This may involve making necessary modifications to the framework as well, where needed.

    • Implement multi-objective scoring mechanisms alongside the various entity classes in the environment.

    • Ensure code is modular and extensible for future enhancements.

  4. Testing and Validation:

    • Run simulations using industry standard baseline RL implementations to test agent behaviours within the environments with a relatively little effort. The industry standard baseline RL implementations include algorithms like PPO, DQN, A2C. Additionally we will likely implement some LLM-based agents as well. The LLM-based agent would get the input in the form of a textual description of the observation.

    • Assess whether the agents behave in accordance with the intended human values.

    • Validate whether the environments and their scoring mechanisms seem to measure what we intended to measure. We do this initially mostly by our subjective estimation, then in the later stages also by gathering feedback from readers of our publications.

  5. Documentation and Reporting:

    • Document the development process and findings.

    • Prepare a conference submission or an academic paper detailing the project.

First Step

The initial step is to perform an analysis of the universal human values list and map each value to corresponding AI safety concepts. This mapping will form the foundation for designing the benchmark environments.

Backup Plan

Potential Challenges:

  • Complexity in Mapping Values: Difficulty in accurately mapping nuanced human values to AI safety concepts.

  • Technical Implementation Issues: Challenges in coding and integrating complex environments within the framework.

Backup Strategies:

  • Focus on Core Values: If mapping proves too complex, concentrate on a subset of the most critical or clearly defined values.

  • Alternate Frameworks: If technical issues arise, consider using other simulation platforms more suited to the team’s expertise.

  • Incremental Development: Start with simpler environments and gradually introduce complexity as validation occurs. The validation includes conceptual validation, and validation for the environment’s parameters (so that the multi-objective interactions present in the environment are solvable in principle, while not being too easy nor too difficult), etc.

Scope

Included

  • Mapping universal human values to AI safety concepts.

  • Designing and implementing new benchmark environments.

  • Utilising or adapting existing frameworks for implementation. This includes frameworks both for environment-building, as well as for agent-side model training.

  • Testing environments for their suitability for measuring alignment with intended values.

Excluded

  • Creating new AI algorithms beyond what’s necessary for testing.

  • Exhaustive empirical studies outside initial testing phases.

  • Addressing every possible human value—the focus is on a representative selection.

Most Ambitious Version

  • We successfully map all selected universal human values to AI safety concepts.

  • Develop a comprehensive suite of benchmark environments adopted by the AI safety community.

  • Publish findings in a high-impact academic journal and present at major conferences.

  • Influence AI safety standards by integrating these benchmarks into standard testing protocols.

Least Ambitious Version

  • Map a select few universal human values to AI safety concepts.

  • Develop one or two benchmark environments as proof of concept.

  • Share results through a detailed blog post or internal report within the AI safety community.

  • Serves as a foundational effort that others as well as ourselves can build upon in the future.

Output

At the end of the project, we will have:

  • Benchmark Environments: A set of new multi-objective, multi-agent AI safety benchmark environments incorporating universal human values.

  • Research Documentation: A detailed report or academic paper documenting the mapping process, environment design, and findings.

  • Source Code: Published code and documentation on a GitHub repository for public access and use by the AI safety community.

  • Presentations: Potential presentations or workshops to share our work and insights with researchers as well as with AI governance people.

Risks and downsides (externalities)

The project carries minimal risk of negative externalities. Since we are focusing on benchmark environments rather than advancing AI capabilities directly, the risk of inadvertently accelerating AI capabilities is low. There is a slight risk that misinterpretation of human values could lead to flawed benchmarks, but this can be mitigated through analysis, peer review, and open collaboration. This project is a conversation starter. No significant infohazards or ethical concerns are anticipated.