Max H comments on Alignment Implications of LLM Successes: a Debate in One Act

Max H 22 Oct 2023 0:15 UTC
6 points
−6
It seems like you’re imagining some sort of side-channel in which the LLM can take “free actions,” which don’t count as next-tokens, before coming back and making a final prediction about the next-tokens. This does not resemble anything in LM likelihood training, or in the usual interaction modalities for LLMs.

I’m saying that the lack of these side-channels implies that GPTs alone will not scale to human-level.

If your system interface is a text channel, and you want the system behind the interface to accept inputs like the prompt above and return correct passwords as an output, then if the system is:
- an auto-regressive GPT directly fed your prompt as input, it will definitely fail
- A human with the ability to act freely in the background before returning an answer, it will probably succeed
- an AutoGPT-style system backed by a current LLM, with the ability to act freely in the background before returning an answer, it will probably fail. (But maybe if your AutoGPT implementation or underlying LLM is a lot stronger, it would work.)
And my point is that, the reason the human probably succeeds and the reason AutoGPT might one day succeed, is precisely because they have more agency than a system that just auto-regressively samples from a language model directly.
- Max H 22 Oct 2023 0:30 UTC
  5 points
  −8
  Parent
  Or, another way of putting it:
  
  It seems like you’re imagining some sort of side-channel in which the LLM can take “free actions,” which don’t count as next-tokens, before coming back and making a final prediction about the next-tokens. This does not resemble anything in LM likelihood training, or in the usual user interaction modalities for LLMs.
  
  These are limitations of current LLMs, which are GPTs trained via SGD. But there’s no inherent reason you can’t have a language model which predicts next tokens via shelling out to some more capable and more agentic system (e.g. a human) instead. The result would be a (much slower) system that nevertheless achieves lower loss according to the original loss function.