Whether you plan on ensuring your AIs always follow instructions, are in some sense corrigible, have at least some measure of pro-sociality, or are entirely value aligned, you are going to need to know what the values of your AI system are, how you can influence them, and how to ensure they’re preserved (or changed in only beneficial ways) when you train them (either during pretraining, post-training, or continuously during deployment).
This seems like it requires solving a very non-trivial problem of operationalizing values the right way. Developmental interpretability seems like it’s very far from being there, and as stated doesn’t seem to be addressing that problem directly.
RLHF does not optimize for achieving goals
RLHF can be seen as optimizing for achieving goals in the world, not just in the sense in the next paragraph? You’re training against a reward model that could be measuring performance on some real-world task.
The “data wall” means active-learning or self-training will get more important over time
Out of curiosity, are you lumping things like “get more data by having some kind of good curation mechanism for lots of AI outputs without necessarily doing self-play and that just works (like say, having one model curate outputs from another, or even having light human oversight on outputs)” under this as well? Not super relevant to the content, just curious whether you would count that under an RL banner and subject to similar dynamics, since that’s my main guess for overcoming the data wall.
This seems like it requires solving a very non-trivial problem of operationalizing values the right way. Developmental interpretability seems like it’s very far from being there, and as stated doesn’t seem to be addressing that problem directly.
Alternatively, of course, you could think harder about how to actually measure what you want to measure. I know this is your strategy when it comes to value detection. And I don’t plan on doing zero of that. But I think there’s useful work to be done without those insights, and would like my theories to be guided more by experiment (and vice versa).
RLHF can be seen as optimizing for achieving goals in the world, not just in the sense in the next paragraph? You’re training against a reward model that could be measuring performance on some real-world task.
I mostly agree, though I don’t think it changes too much. I still think the dominant effect here is on the process by which the LLM solves the task, and in my view there are many other considerations which have just as large an influence on general purpose goal solving, such as human biases, misconceptions, and conversation styles.
If you mean to say we will watch what happens as the LLM acts in the world, then reward or punish it based on how much we like what it does, then this seems a very slow reward signal to me, and in that circumstance I expect most human ratings to be offloaded to other AIs (self-play), or for there to be advances in RL methods before this happens. Currently my understanding is this is not how RLHF is done at the big labs, and instead they use MTurk interactions + expert data curation (+ also self-play via RLAIF/constitutional AI).
Out of curiosity, are you lumping things like “get more data by having some kind of good curation mechanism for lots of AI outputs without necessarily doing self-play and that just works (like say, having one model curate outputs from another, or even having light human oversight on outputs)” under this as well? Not super relevant to the content, just curious whether you would count that under an RL banner and subject to similar dynamics, since that’s my main guess for overcoming the data wall.
This sounds like a generalization of decision transformers to me (i.e. condition on the best of the best outputs, then train on those), and I also include those as prototypical examples in my thinking, so yes.
This seems like it requires solving a very non-trivial problem of operationalizing values the right way. Developmental interpretability seems like it’s very far from being there, and as stated doesn’t seem to be addressing that problem directly.
RLHF can be seen as optimizing for achieving goals in the world, not just in the sense in the next paragraph? You’re training against a reward model that could be measuring performance on some real-world task.
Out of curiosity, are you lumping things like “get more data by having some kind of good curation mechanism for lots of AI outputs without necessarily doing self-play and that just works (like say, having one model curate outputs from another, or even having light human oversight on outputs)” under this as well? Not super relevant to the content, just curious whether you would count that under an RL banner and subject to similar dynamics, since that’s my main guess for overcoming the data wall.
I think we can gain useful information about the development of values even without a full & complete understanding of what values are. For example by studying lookahead, selection criteria between different lookahead nodes, contextually activated heuristics / independently activating motivational heuristics, policy coherence, agents-and-devices (noting the criticisms) style utility-fitting, your own AI objective detecting (& derivatives thereof), and so on.
The solution to not knowing what you’re measuring isn’t to give up hope, its to measure lots of things!
Alternatively, of course, you could think harder about how to actually measure what you want to measure. I know this is your strategy when it comes to value detection. And I don’t plan on doing zero of that. But I think there’s useful work to be done without those insights, and would like my theories to be guided more by experiment (and vice versa).
I mostly agree, though I don’t think it changes too much. I still think the dominant effect here is on the process by which the LLM solves the task, and in my view there are many other considerations which have just as large an influence on general purpose goal solving, such as human biases, misconceptions, and conversation styles.
If you mean to say we will watch what happens as the LLM acts in the world, then reward or punish it based on how much we like what it does, then this seems a very slow reward signal to me, and in that circumstance I expect most human ratings to be offloaded to other AIs (self-play), or for there to be advances in RL methods before this happens. Currently my understanding is this is not how RLHF is done at the big labs, and instead they use MTurk interactions + expert data curation (+ also self-play via RLAIF/constitutional AI).
This sounds like a generalization of decision transformers to me (i.e. condition on the best of the best outputs, then train on those), and I also include those as prototypical examples in my thinking, so yes.