Signal boosted! This is one of those papers that seems less known that it should be. It’s part of the reason why I’m optimistic about dramatic increases in the quality of “prosaic” alignment (in the sense of avoiding jailbreaks and generally behaving as expected) compared to RLHF, and I think it’s part of a path that’s robust enough to scale.
You can compress huge prompts into metatokens, too (just run inference with the prompt to generate the training data). And nest and remix metatokens together.
It’s also interesting in that it can preserve the constraints on learnable values during predictive training, unlike approaches equivalent to RL with sparse/distant rewards.
The fact that the distinctions it learns about the metatokens become better and better as more optimization pressure is applied is an interesting inversion of the usual doom-by-optimization story. Taking such a model to the extreme of optimization just makes it exceedingly good at distinguishing subtle details of what constitutes <nice> versus <authoritative_tone> versus <correct>. It’s an axis of progress in alignment that generalizes as the capability does; the capability is the alignment. I’m pretty certain that a model that has very thoroughly learned what “nice” means at the human level can meaningfully generalize it to contexts where it hasn’t seen it directly applied.[1]
I’m also reasonably confident in finding some other paths to extremely similar effects on internal representations. I wouldn’t be surprised if we can decompose conditions into representational features to learn about what they mean at the learned feature level, then cobble together new inference-time conditions via representational intervention that would have equivalent effects to training new metatokens.
After all, ChatGPT4/DALLE3 can generate an image of a vacuum cleaner that “embodies the aspirational human trait of being kind to one another.” That seems like more of a reach than a hypothetical superintelligence figuring out that humans wouldn’t be okay with, say, a superscience plan that would blow up 25% of the earth’s crust.
I much enjoyed your post Using predictors in corrigible systems — now I need to read the rest of your posts! (I also love the kindness vacuum cleaner.) What I’m calling a simulator (following Janus’s terminology) you call a predictor, but it’s the same insight: LLMs aren’t potentially-dangerous agents, they’re non-agentic systems capable of predicting the sequence of tokens from (many different) potentially-dangerous agents. I also like your metatoken concept: that’s functionally what I’m suggesting for the tags in my proposal, except I follow the suggestion of this paper to embed them via pretraining. Which is slow and computationally expensive, so probably an ideal that one works one’s way up for the essentials, rather then an rapid-iteration technique.
What I’m calling a simulator (following Janus’s terminology) you call a predictor
Yup; I use the terms almost interchangeably. I tend to use “simulator” when referring to predictors used for a simulator-y use case, and “predictor” when I’m referring to how they’re trained and things directly related to that.
I also like your metatoken concept: that’s functionally what I’m suggesting for the tags in my proposal, except I follow the suggestion of this paper to embed them via pretraining.
Yup again—to be clear, all the metatoken stuff I was talking about would also fit in pretraining. Pretty much exactly the same thing. There are versions of it that might get some efficiency boosts by not requiring them to be present for the full duration of pretraining, but still similar in concept. (If we can show an equivalence between trained conditioning and representational interventions, and build representational interventions out of conditions, that could be many orders of magnitude faster.)
Alas, nope! To my knowledge it hasn’t actually been tried at any notable scale; it’s just one of those super simple things that would definitely work if you were willing to spend the compute to distill the behavior.
FWIW, I’m a Staff ML SWE, interested in switching to research engineering, and I’d love to make these things happen — either at a superscaler with ample of resources for it, or failing that, at something like Eleuther or an alignment research lab.
Some of this stuff technically accelerates capabilities (or more specifically, the elicitation of existing capabilities), but I think it also belongs to a more fundamentally reliable path on the tech tree. The sooner the industry embraces it, the less time they spend in other parts of the tech tree that are more prone to misoptimization failures, and the less likely it is that someone figures out how to make those misoptimization failures way more efficient.
I suspect there’s a crux about the path of capabilities development in there for a lot of people; I should probably get around to writing a post about the details at some point.
I’ve seen a number of cases where something that helps alignment also helps capabilities, or vice versa, and also cases where people are worrying a lot about something as an alignment problem that looks to me like primarily a capabilities problem (so given how few alignment engineers we have, maybe we should leave solving it to all the capabilities engineers). Generally I think we’re just not very good at predicting the difference, and tend to want to see this as an either-or taboo rather than a spectrum buried inside a hard-to-anticipate tech tree. In general, capabilities folks also want to control their AI (so it won’t waste tokens, do weird stuff, or get them sued or indicted). The big cross-purposes concerns tend to come mostly from deceit, sharp left turn, and Foom scenarios, where capabilities seem just fine until we drive off the cliff. What I think we need (and even seems to be happening in many orgs, with a few unfortunate exceptions) is for all the capabilities engineers to be aware that alignment is also a challenge and needs to be thought about.
Signal boosted! This is one of those papers that seems less known that it should be. It’s part of the reason why I’m optimistic about dramatic increases in the quality of “prosaic” alignment (in the sense of avoiding jailbreaks and generally behaving as expected) compared to RLHF, and I think it’s part of a path that’s robust enough to scale.
You can compress huge prompts into metatokens, too (just run inference with the prompt to generate the training data). And nest and remix metatokens together.
It’s also interesting in that it can preserve the constraints on learnable values during predictive training, unlike approaches equivalent to RL with sparse/distant rewards.
The fact that the distinctions it learns about the metatokens become better and better as more optimization pressure is applied is an interesting inversion of the usual doom-by-optimization story. Taking such a model to the extreme of optimization just makes it exceedingly good at distinguishing subtle details of what constitutes
<nice>
versus<authoritative_tone>
versus<correct>
. It’s an axis of progress in alignment that generalizes as the capability does; the capability is the alignment. I’m pretty certain that a model that has very thoroughly learned what “nice” means at the human level can meaningfully generalize it to contexts where it hasn’t seen it directly applied.[1]I’m also reasonably confident in finding some other paths to extremely similar effects on internal representations. I wouldn’t be surprised if we can decompose conditions into representational features to learn about what they mean at the learned feature level, then cobble together new inference-time conditions via representational intervention that would have equivalent effects to training new metatokens.
After all, ChatGPT4/DALLE3 can generate an image of a vacuum cleaner that “embodies the aspirational human trait of being kind to one another.” That seems like more of a reach than a hypothetical superintelligence figuring out that humans wouldn’t be okay with, say, a superscience plan that would blow up 25% of the earth’s crust.
I much enjoyed your post Using predictors in corrigible systems — now I need to read the rest of your posts! (I also love the kindness vacuum cleaner.) What I’m calling a simulator (following Janus’s terminology) you call a predictor, but it’s the same insight: LLMs aren’t potentially-dangerous agents, they’re non-agentic systems capable of predicting the sequence of tokens from (many different) potentially-dangerous agents. I also like your metatoken concept: that’s functionally what I’m suggesting for the tags in my proposal, except I follow the suggestion of this paper to embed them via pretraining. Which is slow and computationally expensive, so probably an ideal that one works one’s way up for the essentials, rather then an rapid-iteration technique.
Yup; I use the terms almost interchangeably. I tend to use “simulator” when referring to predictors used for a simulator-y use case, and “predictor” when I’m referring to how they’re trained and things directly related to that.
Yup again—to be clear, all the metatoken stuff I was talking about would also fit in pretraining. Pretty much exactly the same thing. There are versions of it that might get some efficiency boosts by not requiring them to be present for the full duration of pretraining, but still similar in concept. (If we can show an equivalence between trained conditioning and representational interventions, and build representational interventions out of conditions, that could be many orders of magnitude faster.)
I’m very curious about this technique but couldn’t find anything about it. Do you have any references I can read?
Alas, nope! To my knowledge it hasn’t actually been tried at any notable scale; it’s just one of those super simple things that would definitely work if you were willing to spend the compute to distill the behavior.
FWIW, I’m a Staff ML SWE, interested in switching to research engineering, and I’d love to make these things happen — either at a superscaler with ample of resources for it, or failing that, at something like Eleuther or an alignment research lab.
I think that’d be great!
Some of this stuff technically accelerates capabilities (or more specifically, the elicitation of existing capabilities), but I think it also belongs to a more fundamentally reliable path on the tech tree. The sooner the industry embraces it, the less time they spend in other parts of the tech tree that are more prone to misoptimization failures, and the less likely it is that someone figures out how to make those misoptimization failures way more efficient.
I suspect there’s a crux about the path of capabilities development in there for a lot of people; I should probably get around to writing a post about the details at some point.
I’ve seen a number of cases where something that helps alignment also helps capabilities, or vice versa, and also cases where people are worrying a lot about something as an alignment problem that looks to me like primarily a capabilities problem (so given how few alignment engineers we have, maybe we should leave solving it to all the capabilities engineers). Generally I think we’re just not very good at predicting the difference, and tend to want to see this as an either-or taboo rather than a spectrum buried inside a hard-to-anticipate tech tree. In general, capabilities folks also want to control their AI (so it won’t waste tokens, do weird stuff, or get them sued or indicted). The big cross-purposes concerns tend to come mostly from deceit, sharp left turn, and Foom scenarios, where capabilities seem just fine until we drive off the cliff. What I think we need (and even seems to be happening in many orgs, with a few unfortunate exceptions) is for all the capabilities engineers to be aware that alignment is also a challenge and needs to be thought about.