RogerDearnaley comments on How to Control an LLM’s Behavior (why my P(DOOM) went down)

RogerDearnaley 29 Nov 2023 5:02 UTC
2 points
0
I much enjoyed your post Using predictors in corrigible systems — now I need to read the rest of your posts! (I also love the kindness vacuum cleaner.) What I’m calling a simulator (following Janus’s terminology) you call a predictor, but it’s the same insight: LLMs aren’t potentially-dangerous agents, they’re non-agentic systems capable of predicting the sequence of tokens from (many different) potentially-dangerous agents. I also like your metatoken concept: that’s functionally what I’m suggesting for the tags in my proposal, except I follow the suggestion of this paper to embed them via pretraining. Which is slow and computationally expensive, so probably an ideal that one works one’s way up for the essentials, rather then an rapid-iteration technique.
- porby 29 Nov 2023 19:18 UTC
  5 points
  1
  Parent
  What I’m calling a simulator (following Janus’s terminology) you call a predictor
  Yup; I use the terms almost interchangeably. I tend to use “simulator” when referring to predictors used for a simulator-y use case, and “predictor” when I’m referring to how they’re trained and things directly related to that.
  I also like your metatoken concept: that’s functionally what I’m suggesting for the tags in my proposal, except I follow the suggestion of this paper to embed them via pretraining.
  Yup again—to be clear, all the metatoken stuff I was talking about would also fit in pretraining. Pretty much exactly the same thing. There are versions of it that might get some efficiency boosts by not requiring them to be present for the full duration of pretraining, but still similar in concept. (If we can show an equivalence between trained conditioning and representational interventions, and build representational interventions out of conditions, that could be many orders of magnitude faster.)