I much enjoyed your post Using predictors in corrigible systems — now I need to read the rest of your posts! (I also love the kindness vacuum cleaner.) What I’m calling a simulator (following Janus’s terminology) you call a predictor, but it’s the same insight: LLMs aren’t potentially-dangerous agents, they’re non-agentic systems capable of predicting the sequence of tokens from (many different) potentially-dangerous agents. I also like your metatoken concept: that’s functionally what I’m suggesting for the tags in my proposal, except I follow the suggestion of this paper to embed them via pretraining. Which is slow and computationally expensive, so probably an ideal that one works one’s way up for the essentials, rather then an rapid-iteration technique.
What I’m calling a simulator (following Janus’s terminology) you call a predictor
Yup; I use the terms almost interchangeably. I tend to use “simulator” when referring to predictors used for a simulator-y use case, and “predictor” when I’m referring to how they’re trained and things directly related to that.
I also like your metatoken concept: that’s functionally what I’m suggesting for the tags in my proposal, except I follow the suggestion of this paper to embed them via pretraining.
Yup again—to be clear, all the metatoken stuff I was talking about would also fit in pretraining. Pretty much exactly the same thing. There are versions of it that might get some efficiency boosts by not requiring them to be present for the full duration of pretraining, but still similar in concept. (If we can show an equivalence between trained conditioning and representational interventions, and build representational interventions out of conditions, that could be many orders of magnitude faster.)
I much enjoyed your post Using predictors in corrigible systems — now I need to read the rest of your posts! (I also love the kindness vacuum cleaner.) What I’m calling a simulator (following Janus’s terminology) you call a predictor, but it’s the same insight: LLMs aren’t potentially-dangerous agents, they’re non-agentic systems capable of predicting the sequence of tokens from (many different) potentially-dangerous agents. I also like your metatoken concept: that’s functionally what I’m suggesting for the tags in my proposal, except I follow the suggestion of this paper to embed them via pretraining. Which is slow and computationally expensive, so probably an ideal that one works one’s way up for the essentials, rather then an rapid-iteration technique.
Yup; I use the terms almost interchangeably. I tend to use “simulator” when referring to predictors used for a simulator-y use case, and “predictor” when I’m referring to how they’re trained and things directly related to that.
Yup again—to be clear, all the metatoken stuff I was talking about would also fit in pretraining. Pretty much exactly the same thing. There are versions of it that might get some efficiency boosts by not requiring them to be present for the full duration of pretraining, but still similar in concept. (If we can show an equivalence between trained conditioning and representational interventions, and build representational interventions out of conditions, that could be many orders of magnitude faster.)