abramdemski comments on Are limited-horizon agents a good heuristic for the off-switch problem?

abramdemski 10 Dec 2021 18:49 UTC
LW: 6 AF: 4
AF
We also know how to implement it today.
I would argue that inner alignment problems mean we do not know how to do this today. We know how to limit the planning horizon for parts of a system which are doing explicit planning, but this doesn’t bar other parts of the system from doing planning. For example, GPT-3 has a time horizon of effectively one token (it is only trying to predict one token at a time). However, it probably learns to internally plan ahead anyway, just because thinking about the rest of the current sentence (at least) is useful for thinking about the next token.
So, a big part of the challenge of creating myopic systems is making darn sure they’re as myopic as you think they are.
- davidad 21 Dec 2021 10:12 UTC
  LW: 4 AF: 4
  AF Parent
  I’m curious to dig into your example.
  - Here’s an experiment that I could imagine uncovering such internal planning:
    make sure the corpus has no instances of a token “jrzxd”, then
    insert long sequences of “jrzxd jrzxd jrzxd … jrzxd” at random locations in the middle of sentences (sort of like introns),
    then observe whether the trained model predicts “jrzxd” with greater likelihood than its base rate (which we’d presume is because it’s planning to take some loss now in exchange for confidently predicting more “jrzxd”s to follow).
  - I think this sort of behavior could be coaxed out of an actor-critic model (with hyperparameter tuning, etc.), but not GPT-3. GPT-3 doesn’t have any pressure towards a Bellman-equation-satisfying model, where future reward influences current output probabilities.
  - I’m curious if you agree or disagree and what you think I’m missing.
  - abramdemski 21 Dec 2021 18:27 UTC
    LW: 3 AF: 3
    AF Parent
    I think we could get a GPT-like model to do this if we inserted other random sequences, in the same way, in the training data; it should learn a pattern like “non-word-like sequences that repeat at least twice tend to repeat a few more times” or something like that.
    GPT-3 itself may or may not get the idea, since it does have some significant breadth of getting-the-idea-of-local-patterns-its-never-seen-before.
    So I don’t currently see what your experiment has to do with the planning-ahead question.
    I would say that the GPT training process has no “inherent” pressure toward Bellman-like behavior, but the data provides such pressure, because humans are doing something more Bellman-like when producing strings. A more obvious example would be if you trained a GPT-like system to predict the chess moves of a tree-search planning agent.