Steven Byrnes comments on Does reducing the amount of RL for a given capability level make AI safer?

Steven Byrnes 8 May 2024 14:52 UTC
4 points
0
I agree that in the limit of an extremely structured optimizer, it will work in practice, and it will wind up following strategies that you can guess to some extent a priori.
I also agree that in the limit of an extremely unstructured optimizer, it will not work in practice, but if it did, it will find out-of-the-box strategies that are difficult to guess a priori.
But I disagree that there’s no possible RL system in between those extremes where you can have it both ways.
On the contrary, I think it’s possible to design an optimizer which is structured enough to work well in practice, while simultaneously being unstructured enough that it will find out-of-the-box solutions very different from anything the programmers were imagining.
Examples include:
- MuZero: you can’t predict a priori what chess strategies a trained MuZero will wind up using by looking at the source code. The best you can do is say “MuZero is likely to use strategies that lead to its winning the game”.
- “A civilization of humans” is another good example: I don’t think you can look at the human brain neural architecture and loss functions etc., and figure out a priori that a civilization of humans will wind up inventing nuclear weapons. Right?
- porby 9 May 2024 3:13 UTC
  4 points
  1
  Parent
  But I disagree that there’s no possible RL system in between those extremes where you can have it both ways.
  I don’t disagree. For clarity, I would make these claims, and I do not think they are in tension:
  1. Something being called “RL” alone is not the relevant question for risk. It’s how much space the optimizer has to roam.
  2. MuZero-like strategies are free to explore more space than something like current applications of RLHF. Improved versions of these systems working in more general environments have the capacity to do surprising things and will tend to be less ‘bound’ in expectation than RLHF. Because of that extra space, these approaches are more concerning in a fully general and open-ended environment.
  3. MuZero-like strategies remain very distant from a brute-forced policy search, and that difference matters a lot in practice.
  4. Regardless of the category of the technique, safe use requires understanding the scope of its optimization. This is not the same as knowing what specific strategies it will use. For example, despite finding unforeseen strategies, you can reasonably claim that MuZero (in its original form and application) will not be deceptively aligned to its task.
  5. Not all applications of tractable RL-like algorithms are safe or wise.
  6. There do exist safe applications of RL-like algorithms.