Charlie Steiner comments on Building AI safety benchmark environments on themes of universal human values

Charlie Steiner 6 Jan 2025 20:50 UTC
4 points
0
I agree it’s a good point that you don’t need the complexity of the whole world to test ideas. With a fairly small in terms of number of states, you can encode interesting things in a long sequence of states so long as the generating process is sufficiently interesting. And adding more states is itself no virtue if it doesn’t help you understand what you’re trying to test for.
Some out-of-order thoughts:
- Testing for ‘big’ values, e.g. achievement, might require complex environments and evaluations. Not necessarily large state spaces, but the complexity of differentiating between subtle shades of value (which seems like a useful capability to be sure we’re getting) has to go somewhere.
- Using more complicated environments that are human-legible might better leverage human feedback and/or make sense to human observers—maybe you could encode achievement in the actions of a square in a gridworld, but maybe humans would end up making a lot of mistakes when trying to judge the outcome. If you want to gather data from humans, to reflect a way that humans are complicated that you want to see if an AI can learn, a rich environment seems useful. On the other hand, if you just want to test general learning power, you could have a square in a gridworld have random complex decision procedures and see if they can be learned.
- There’s a divide between contexts where humans are basically right, and so we just want an AI to do a good job of learning what we’re already doing, and contexts where humans are inconsistent, or disagree with each other, where we want an AI to carefully resolve these inconsistencies/disagreements in a way that humans endorse (except also sometimes we’re inconsistent or disagree about our standards for resolving inconsistencies and disagreements!).
  Building small benchmarks for the first kind of problem seems kind of trivial in the fully-observed setting where the AI can’t wirehead. Even if you try to emulate the partial observability of the real world, and include the AI being able to eventually control the reward signal as a natural part of the world, it seems like seizing control of the reward signal is the crux rather than the content of what values are being demonstrated inside the gridworld (I guess it’s useful to check if the content matters, I just don’t expect it to), and a useful benchmark might be focused on how seizing control of the reward signal (or not doing so) scales to the real world.
  Building small benchmarks for the latter kind of problem seems important. The main difficulty is more philosophical than practical—we don’t know what standard to hold the benchmarks to. But supposing we had some standard in mind, I would still worry that a small benchmark would be more easily gamed, and more likely to miss some of the ways humans are inconsistent or disagree. I would also expect benchmarks of this sort, whatever the size, to be a worse fit for normal RL algorithms, and run into issues where different learning algorithms might request different sorts of interaction with the environment (although this could be solved either by using real human feedback in a contrived situation, or by having simulated inhabitants of the environment who are very good at giving diverse feedback).