Charlie Steiner comments on Building AI safety benchmark environments on themes of universal human values

Charlie Steiner 4 Jan 2025 19:28 UTC
2 points
0
I’m not excited by gridworlds, because they tend to to skip straight to representing the high-level objects we’re supposed to value, without bothering to represent all the low-level structure that actually lets us learn and generalize values in the real world.
Do you have plans for how to deal with this, or plans to think about richer environments?
- Roland Pihlakas 6 Jan 2025 8:49 UTC
  3 points
  0
  Parent
  Thank you for your question!
  I agree that the simulations need to have sufficient complexity. Indeed, that was one of main motivations I became interested in creating multi-objective benchmarks in the past. Various AI safety toy problems seemed to me so much simplified that they lacked essential objectives and other decisive nuances. This motivation is still very much one of my main driving motivations.
  That being said, complexity has also downsides:
  1) The complexity introduces confounding factors. When a model fails such a benchmark, it is not clear whether it was because it did not have required perceptual capabilities (so it is a capabilities problem), or it is using a model/framework that is unsuitable for alignment (so it is an alignment problem).
  2) Running the simulations will be more time consuming and it would make the research elitist in the sense that various people would not be able to afford it.
  My plan is to try to start with preference towards simple, but not simpler than necessary. And then gradually make it more complex. That means trying to use the gridworlds and introducing as many symbols as is needed to represent the important objectives, objects, other concepts and phenomena, and their interactions.
  I believe symbolic approaches should not be entirely dismissed. As a illustrative metaphor, I am thinking of books—they contains symbols, yet we consider them as a cornerstone of our civilization. Similarly to the current dilemma with benchmarks, we may then worry whether books are too simple and symbol based—or perhaps one should prefer watching movies instead, since they represent reality in more detail. But would that claim be necessarily true? It does not seem so obvious after all.
  In case more complexity is needed, there are currently at least five ideas:
  1) Adding more feature layers to the gridworld. I did not mention it before, but the observation format already supports multiple concurrent observable layers on top of each other. One of the layers could be for example facial expressions, or any other observable and unobservable metrics relevant to objects they accompany.
  2) Adding textual messages between agents as a side panel to the gridworlds.
  3) Making the environment bigger, so there are more objects and more phenomena.
  4) Making the environment bigger and making also the objects bigger so that they cover multiple cells in the grid. Thus the objects will become composite, consisting of sub-parts with their own dynamics.
  5) Using some other framework, for example Sims.
  Curious, how do these thoughts and considerations land with you?
  - Charlie Steiner 6 Jan 2025 20:50 UTC
    2 points
    0
    Parent
    I agree it’s a good point that you don’t need the complexity of the whole world to test ideas. With a fairly small in terms of number of states, you can encode interesting things in a long sequence of states so long as the generating process is sufficiently interesting. And adding more states is itself no virtue if it doesn’t help you understand what you’re trying to test for.
    Some out-of-order thoughts:
    Testing for ‘big’ values, e.g. achievement, might require complex environments and evaluations. Not necessarily large state spaces, but the complexity of differentiating between subtle shades of value (which seems like a useful capability to be sure we’re getting) has to go somewhere.
    Using more complicated environments that are human-legible might better leverage human feedback and/or make sense to human observers—maybe you could encode achievement in the actions of a square in a gridworld, but maybe humans would end up making a lot of mistakes when trying to judge the outcome. If you want to gather data from humans, to reflect a way that humans are complicated that you want to see if an AI can learn, a rich environment seems useful. On the other hand, if you just want to test general learning power, you could have a square in a gridworld have random complex decision procedures and see if they can be learned.
    There’s a divide between contexts where humans are basically right, and so we just want an AI to do a good job of learning what we’re already doing, and contexts where humans are inconsistent, or disagree with each other, where we want an AI to carefully resolve these inconsistencies/disagreements in a way that humans endorse (except also sometimes we’re inconsistent or disagree about our standards for resolving inconsistencies and disagreements!).
    Building small benchmarks for the first kind of problem seems kind of trivial in the fully-observed setting where the AI can’t wirehead. Even if you try to emulate the partial observability of the real world, and include the AI being able to eventually control the reward signal as a natural part of the world, it seems like seizing control of the reward signal is the crux rather than the content of what values are being demonstrated inside the gridworld (I guess it’s useful to check if the content matters, I just don’t expect it to), and a useful benchmark might be focused on how seizing control of the reward signal (or not doing so) scales to the real world.
    Building small benchmarks for the latter kind of problem seems important. The main difficulty is more philosophical than practical—we don’t know what standard to hold the benchmarks to. But supposing we had some standard in mind, I would still worry that a small benchmark would be more easily gamed, and more likely to miss some of the ways humans are inconsistent or disagree. I would also expect benchmarks of this sort, whatever the size, to be a worse fit for normal RL algorithms, and run into issues where different learning algorithms might request different sorts of interaction with the environment (although this could be solved either by using real human feedback in a contrived situation, or by having simulated inhabitants of the environment who are very good at giving diverse feedback).