I expect future capabilities advances to follow a similar pattern as past capabilities advances, and not completely break the existing alignment techniques.
But for the rest of it, I don’t see this as addressing the case for pessimism, which is not problems from the reference class that contains “the LLM sometimes outputs naughty sentences” but instead problems from the reference class that contains “we don’t know how to prevent an ontological collapse, where meaning structures constructed under one world-model compile to something different under a different world model.”
Or, like, once LLMs gain the capability to design proteins (because you added in a relevant dataset, say), do you really expect the ‘helpful, harmless, honest’ alignment techniques that were used to make a chatbot not accidentally offend users to also work for making a biologist-bot not accidentally murder patients? Put another way, I think new capabilities advances reveal new alignment challenges and unless alignment techniques are clearly cutting at the root of the problem, I don’t expect that they will easily transfer to those new challenges.
But for the rest of it, I don’t see this as addressing the case for pessimism, which is not problems from the reference class that contains “the LLM sometimes outputs naughty sentences” but instead problems from the reference class that contains “we don’t know how to prevent an ontological collapse, where meaning structures constructed under one world-model compile to something different under a different world model.”
I dislike this minimization of contemporary alignment progress. Even just limiting ourselves to RLHF, that method addresses far more problems than “the LLM sometimes outputs naughty sentences”. E.g., it also tackles problems such as consistently following user instructions, reducing hallucinations, improving the topicality of LLM suggestions, etc. It allows much more significant interfacing with the cognition and objectives pursued by LLMs than just some profanity filter.
I don’t think ontological collapse is a real issue (or at least, not an issue that appropriate training data can’t solve in a relatively straightforwards way). I feel similarly about lots of things that are speculated to be convergent problems for ML systems, such as wireheading and mesaoptimization.
Or, like, once LLMs gain the capability to design proteins (because you added in a relevant dataset, say), do you really expect the ‘helpful, harmless, honest’ alignment techniques that were used to make a chatbot not accidentally offend users to also work for making a biologist-bot not accidentally murder patients?
If you’re referring to the technique used on LLMs (RLHF), then the answer seems like an obvious yes. RLHF just refers to using reinforcement learning with supervisory signals from a preference model. It’s an incredibly powerful and flexible approach, one that’s only marginally less general than reinforcement learning itself (can’t use it for things you can’t build a preference model of). It seems clear enough to me that you could do RLHF over the biologist-bot’s action outputs in the biological domain, and be able to shape its behavior there.
If you’re referring to just doing language-only RLHF on the model, then making a bio-model, and seeing if the RLHF influences the bio-model’s behaviors, then I think the answer is “variable, and it depends a lot on the specifics of the RLHF and how the cross-modal grounding works”.
People often translate non-lingual modalities into language so LLMs can operate in their “native element” in those other domains. Assuming you don’t do that, then yes, I could easily see the language-only RLHF training having little impact on the bio-model’s behaviors.
However, if the bio-model were acting multi-modally by e.g., alternating between biological sequence outputs and natural language planning of what to use those outputs for, then I expect the RLHF would constrain the language portions of that dialog. Then, there are two options:
Bio-bot’s multi-modal outputs don’t correctly ground between language and bio-sequences.
In this case, bio-bot’s language planning doesn’t correctly describe the sequences its outputting, so the RLHF doesn’t constrain those sequences.
However, if bio-bot doesn’t ground cross-modally, than bio-bot also can’t benefit from its ability to plan in the language modality to better use its bio modality capabilities (which are presumably much better for planning than its bio-modality).
Bio-bot’s multi-modal outputs DO correctly ground between language and bio-sequences.
In that case, the RLHF-constrained language does correctly describe the bio-sequences, and so the language-only RLHF training does also constrain bio-bot’s biology-related behavior.
Put another way, I think new capabilities advances reveal new alignment challenges and unless alignment techniques are clearly cutting at the root of the problem, I don’t expect that they will easily transfer to those new challenges.
Whereas I see future alignment challenges as intimately tied to those we’ve had to tackle for previous, less capable models. E.g., your bio-bot example is basically a problem of cross-modality grounding, on which there has been an enormous amount of past work, driven by the fact that cross-modality grounding is a problem for systems across very broad ranges of capabilities.
Part of this is just straight disagreement, I think; see So8res’s Sharp Left Turn and follow-on discussion.
But for the rest of it, I don’t see this as addressing the case for pessimism, which is not problems from the reference class that contains “the LLM sometimes outputs naughty sentences” but instead problems from the reference class that contains “we don’t know how to prevent an ontological collapse, where meaning structures constructed under one world-model compile to something different under a different world model.”
Or, like, once LLMs gain the capability to design proteins (because you added in a relevant dataset, say), do you really expect the ‘helpful, harmless, honest’ alignment techniques that were used to make a chatbot not accidentally offend users to also work for making a biologist-bot not accidentally murder patients? Put another way, I think new capabilities advances reveal new alignment challenges and unless alignment techniques are clearly cutting at the root of the problem, I don’t expect that they will easily transfer to those new challenges.
Evolution provides no evidence for the sharp left turn
I dislike this minimization of contemporary alignment progress. Even just limiting ourselves to RLHF, that method addresses far more problems than “the LLM sometimes outputs naughty sentences”. E.g., it also tackles problems such as consistently following user instructions, reducing hallucinations, improving the topicality of LLM suggestions, etc. It allows much more significant interfacing with the cognition and objectives pursued by LLMs than just some profanity filter.
I don’t think ontological collapse is a real issue (or at least, not an issue that appropriate training data can’t solve in a relatively straightforwards way). I feel similarly about lots of things that are speculated to be convergent problems for ML systems, such as wireheading and mesaoptimization.
If you’re referring to the technique used on LLMs (RLHF), then the answer seems like an obvious yes. RLHF just refers to using reinforcement learning with supervisory signals from a preference model. It’s an incredibly powerful and flexible approach, one that’s only marginally less general than reinforcement learning itself (can’t use it for things you can’t build a preference model of). It seems clear enough to me that you could do RLHF over the biologist-bot’s action outputs in the biological domain, and be able to shape its behavior there.
If you’re referring to just doing language-only RLHF on the model, then making a bio-model, and seeing if the RLHF influences the bio-model’s behaviors, then I think the answer is “variable, and it depends a lot on the specifics of the RLHF and how the cross-modal grounding works”.
People often translate non-lingual modalities into language so LLMs can operate in their “native element” in those other domains. Assuming you don’t do that, then yes, I could easily see the language-only RLHF training having little impact on the bio-model’s behaviors.
However, if the bio-model were acting multi-modally by e.g., alternating between biological sequence outputs and natural language planning of what to use those outputs for, then I expect the RLHF would constrain the language portions of that dialog. Then, there are two options:
Bio-bot’s multi-modal outputs don’t correctly ground between language and bio-sequences.
In this case, bio-bot’s language planning doesn’t correctly describe the sequences its outputting, so the RLHF doesn’t constrain those sequences.
However, if bio-bot doesn’t ground cross-modally, than bio-bot also can’t benefit from its ability to plan in the language modality to better use its bio modality capabilities (which are presumably much better for planning than its bio-modality).
Bio-bot’s multi-modal outputs DO correctly ground between language and bio-sequences.
In that case, the RLHF-constrained language does correctly describe the bio-sequences, and so the language-only RLHF training does also constrain bio-bot’s biology-related behavior.
Whereas I see future alignment challenges as intimately tied to those we’ve had to tackle for previous, less capable models. E.g., your bio-bot example is basically a problem of cross-modality grounding, on which there has been an enormous amount of past work, driven by the fact that cross-modality grounding is a problem for systems across very broad ranges of capabilities.