Steven Byrnes comments on Does reducing the amount of RL for a given capability level make AI safer?

Steven Byrnes 13 May 2024 19:28 UTC
8 points
9
@nostalgebraist bites that bullet here:
“Reinforcement learning” (RL) is not a technique. It’s a problem statement, i.e. a way of framing a task as an optimization problem, so you can hand it over to a mechanical optimizer.

What’s more, even calling it a problem statement is misleading, because it’s (almost) the most general problem statement possible for any arbitrary task. If you try to formalize a concept like “doing a task well,“ or even “being an entity that acts freely and wants things,” in the most generic terms with no constraints whatsoever, you end up writing down “reinforcement learning.”
and so does Russell & Norvig 3rd edition
Reinforcement learning might be considered to encompass all of AI.
…That’s not my perspective though.
For my part, there’s a stereotypical core of things-I-call-RL which entails:
- (A) There’s a notion of the AI’s outputs being better or worse
- (B) But we don’t have ground truth (even after-the-fact) about what any particular output should ideally have been
- (C) Therefore the system needs to do some kind of explore-exploit.
By this (loose) definition, both model-based and model-free RL are central examples of “reinforcement learning”, whereas LLM self-supervised base models are not reinforcement learning (cf. (B), (C)), nor are ConvNet classifiers trained on ImageNet (ditto), nor are clustering algorithms (cf. (A), (C)), nor is A* or any other exhaustive search within a simple deterministic domain (cf. (C)), nor are VAEs (cf. (C)), etc.
(A,B,C) is generally the situation you face if you want an AI to win at videogames or board games, control bodies while adapting to unpredictable injuries or terrain, write down math proofs, design chips, found companies, and so on.
This (loose) definition of RL connects to AGI safety because (B-C) makes it harder to predict the outputs of an RL system. E.g. we can plausibly guess that an LLM base model, given internet-text-like prompts, will continue in an internet-text-typical way. Granted, given OOD prompts, it’s harder to say things a priori about the output. But that’s nothing compared to e.g. AlphaZero or AlphaStar, where we’re almost completely in the dark about what the trained model will do in any nontrivial game-state whatsoever. (…Then extrapolate the latter to human-level AGIs acting in the real world!)
(That’s not an argument that “we’re doomed if AGI is based on RL”, but I do think that a very RL-centric AGI would need tailored approaches to thinking about safety and alignment that wouldn’t apply to LLMs; and I likewise think that likewise a massive increase in the scope and centrality of LLM-related RL (beyond the RLHF status quo) would raise new (and concerning) alignment issues, different from the ones we’re used to with LLMs today.)