Evan R. Murphy comments on Steering Behaviour: Testing for (Non-)Myopia in Language Models

Evan R. Murphy 6 Dec 2022 20:33 UTC
1 point
0
That is one form of myopia/non-myopia. Is there some reason you consider that the primary or only definition of myopia?
Personally I think it’s a bit narrow to consider that the only form of myopia. What you’re describing has in previous discussions been called “per-episode myopia”. This is in contrast to “per-step myopia”. I like the explanation of the difference between these in Evan Hubinger’s AI safety via market making, where he also suggests a practical benefit of per-step myopia (bolded):
Before I talk about the importance of per-step myopia, it’s worth noting that debate is fully compatible with per-episode myopia—in fact, it basically requires it. If a debater is not per-episode myopic, then it will try to maximize its reward across all debates, not just the single debate—the single episode—it’s currently in. Such per-episode non-myopic agents can then become deceptively aligned, as they might choose to act deceptively during training in order to defect during deployment. Per-episode myopia, however, rules this out. Unfortunately, in my opinion, per-episode myopia seems like a very difficult condition to enforce—once your agents are running multi-step optimization algorithms, how do you tell whether that optimization passes through the episode boundary or not? Enforcing per-step myopia, on the other hand, just requires detecting the existence of multi-step optimization, rather than its extent, which seems considerably easier. Thus, since AI safety via market making is fully compatible with per-step myopia verification, it could be significantly easier to prevent the development of deceptive alignment.
LLMs like GPT-3 don’t exactly have ‘steps’ but I think next-token myopia as we describe in the “What is myopia?” section would be the analogous thing for these sorts of models. I’m sympathetic to that bolded idea above that per-step/next-token myopia will be easier to verify, which is why I personally am more excited about focusing on that form of myopia. But if someone comes up with a way to confidently verify per-episode myopia, then I would be very interested in that as well.
While we focused on per-step/next-token myopia in the present post, we did discuss per-episode myopia (and/or its foil, cross-episode non-myopia) in a couple places, just so folks know. First in footnote #2:
2. ^ There is a related notion of a non-myopic language model that could ‘anticipate’ tokens not only in the current generation, but in future generations or “episodes” as well. But for the experiments we present in this post, we are only considering non-myopia as ‘anticipating’ tokens within the current generation.
And then in the Ideas for future experiments section:
- Test non-myopia across “multiple episodes”, as suggested by Aidan O’Gara on a draft of this post.^[11] We would be really surprised if this turned up positive results for GPT-3 variants, but it could become increasingly interesting to study as models become larger and more sophisticated (depending on the training methods used).
- David Johnston 6 Dec 2022 23:15 UTC
  2 points
  0
  Parent
  Your definition is
  For a myopic language model, the next token in a prompt completion is generated based on whatever the model has learned in service of minimising loss on the next token and the next token alone
  A non-myopic language model, on the other hand, can ‘compromise’ on the loss of the immediate next token so that the overall loss over multiple tokens is lower—i.e possible loss on future tokens in the completion may be ‘factored in’ when generating the next immediate token
  Here’s a rough argument for why I don’t think this is a great definition (there may well be holes in it). If a language model minimises a proper loss for its next token prediction, then the loss minimising prediction is $P (X_{n} | X_{< n})$ where $X_{n}$ is the $n$ th token. The proper loss minimising prediction for the next $[n, n + m]$ is $P (X_{[n, n + m]} | X_{< n})$ where $P$ is the same probability distribution. Thus with a proper loss there’s no difference between “greedy next-token prediction” and “lookahead prediction”.
  
  If it’s not minimising a proper loss, on the other hand, there are lots of ways in which it can deviate from predicting with a probability distribution, and I’d be surprised if “it’s non-myopic” was an especially useful way to analyse the situation.
  
  On the other hand, I think cross-episode non-myopia is a fairly clear violation of a design idealisation—that completions are independent conditional on the training data and the prompt.