davidad comments on Are limited-horizon agents a good heuristic for the off-switch problem?

davidad 21 Dec 2021 10:12 UTC
LW: 4 AF: 4
AF
I’m curious to dig into your example.
- Here’s an experiment that I could imagine uncovering such internal planning:
  - make sure the corpus has no instances of a token “jrzxd”, then
  - insert long sequences of “jrzxd jrzxd jrzxd … jrzxd” at random locations in the middle of sentences (sort of like introns),
  - then observe whether the trained model predicts “jrzxd” with greater likelihood than its base rate (which we’d presume is because it’s planning to take some loss now in exchange for confidently predicting more “jrzxd”s to follow).
- I think this sort of behavior could be coaxed out of an actor-critic model (with hyperparameter tuning, etc.), but not GPT-3. GPT-3 doesn’t have any pressure towards a Bellman-equation-satisfying model, where future reward influences current output probabilities.
- I’m curious if you agree or disagree and what you think I’m missing.
- abramdemski 21 Dec 2021 18:27 UTC
  LW: 3 AF: 3
  AF Parent
  I think we could get a GPT-like model to do this if we inserted other random sequences, in the same way, in the training data; it should learn a pattern like “non-word-like sequences that repeat at least twice tend to repeat a few more times” or something like that.
  GPT-3 itself may or may not get the idea, since it does have some significant breadth of getting-the-idea-of-local-patterns-its-never-seen-before.
  So I don’t currently see what your experiment has to do with the planning-ahead question.
  I would say that the GPT training process has no “inherent” pressure toward Bellman-like behavior, but the data provides such pressure, because humans are doing something more Bellman-like when producing strings. A more obvious example would be if you trained a GPT-like system to predict the chess moves of a tree-search planning agent.