Garrett Baker comments on What and Why: Developmental Interpretability of Reinforcement Learning

Garrett Baker 9 Jul 2024 15:52 UTC
4 points
0

This seems like it requires solving a very non-trivial problem of operationalizing values the right way. Developmental interpretability seems like it’s very far from being there, and as stated doesn’t seem to be addressing that problem directly.

I think we can gain useful information about the development of values even without a full & complete understanding of what values are. For example by studying lookahead, selection criteria between different lookahead nodes, contextually activated heuristics / independently activating motivational heuristics, policy coherence, agents-and-devices (noting the criticisms) style utility-fitting, your own AI objective detecting (& derivatives thereof), and so on.

The solution to not knowing what you’re measuring isn’t to give up hope, its to measure lots of things!

Alternatively, of course, you could think harder about how to actually measure what you want to measure. I know this is your strategy when it comes to value detection. And I don’t plan on doing zero of that. But I think there’s useful work to be done without those insights, and would like my theories to be guided more by experiment (and vice versa).

RLHF can be seen as optimizing for achieving goals in the world, not just in the sense in the next paragraph? You’re training against a reward model that could be measuring performance on some real-world task.

I mostly agree, though I don’t think it changes too much. I still think the dominant effect here is on the process by which the LLM solves the task, and in my view there are many other considerations which have just as large an influence on general purpose goal solving, such as human biases, misconceptions, and conversation styles.

If you mean to say we will watch what happens as the LLM acts in the world, then reward or punish it based on how much we like what it does, then this seems a very slow reward signal to me, and in that circumstance I expect most human ratings to be offloaded to other AIs (self-play), or for there to be advances in RL methods before this happens. Currently my understanding is this is not how RLHF is done at the big labs, and instead they use MTurk interactions + expert data curation (+ also self-play via RLAIF/constitutional AI).

Out of curiosity, are you lumping things like “get more data by having some kind of good curation mechanism for lots of AI outputs without necessarily doing self-play and that just works (like say, having one model curate outputs from another, or even having light human oversight on outputs)” under this as well? Not super relevant to the content, just curious whether you would count that under an RL banner and subject to similar dynamics, since that’s my main guess for overcoming the data wall.

This sounds like a generalization of decision transformers to me (i.e. condition on the best of the best outputs, then train on those), and I also include those as prototypical examples in my thinking, so yes.