Max H comments on Alignment Implications of LLM Successes: a Debate in One Act

Max H 21 Oct 2023 23:20 UTC
0 points
0
But we have no evidence that this homunculus exists inside GPT-4, or any LLM. More pointedly, as LLMs have made remarkable strides toward human-level general intelligence, we have not observed a parallel trend toward becoming “more homuncular,” more like a generally capable agent being pressed into service for next-token prediction.
“Remarkable strides”, maybe, but current language models aren’t exactly close to human-level in the relevant sense.
There are plenty of tasks a human could solve by exerting a tiny bit of agency or goal-directedness that are still far outside the reach of any LLM. Some of those tasks can even be framed as text prediction problems. From a recent dialogue:
For example, if you want to predict the next tokens in the following prompt:
```
I just made up a random password, memorized it, and hashed it. The SHA-256 sum is: d998a06a8481bff2a47d63fd2960e69a07bc46fcca10d810c44a29854e1cbe51. A plausible guess for what the password was, assuming I'm telling the truth, is:
```
The best way to do that is to guess an 8-16 digit string that actually hashes to that. You could find such a string via bruteforce computation, or actual brute force, or just paying me $5 to tell you the actual password.
If GPTs trained via SGD never hit on those kinds of strategies no matter how large they are and how much training data you give them, that just means that GPTs alone won’t scale to human-level, since an actual human is capable of coming up with and executing any of those strategies.
The point is that agency isn’t some kind of exotic property that only becomes relevant or inescapable at hypothetical superintelligence capability levels—it looks like a fundamental / instrumentally convergent part of ordinary human-level intelligence.
- nostalgebraist 22 Oct 2023 0:00 UTC
  34 points
  23
  Parent
  The example confuses me.
  If you literally mean you are prompting the LLM with that text, then the LLM must output the answer immediately, as the string of next-tokens right after the words assuming I'm telling the truth, is:. There is no room in which to perform other, intermediate actions like persuading you to provide information.
  It seems like you’re imagining some sort of side-channel in which the LLM can take “free actions,” which don’t count as next-tokens, before coming back and making a final prediction about the next-tokens. This does not resemble anything in LM likelihood training, or in the usual user interaction modalities for LLMs.
  You also seem to be picturing the LLM like an RL agent, trying to minimize next-token loss over an entire rollout. But this isn’t how likelihood training works. For instance, GPTs do not try to steer texts in directions that will make them easier to predict later (because the loss does not care whether they do this or not).
  (On the other hand, if you told GPT-4 that it was in this situation—trying to predict next-tokens, with some sort of side channel it can use to gather information from the world—and asked it to come up with plans, I expect it would be able to come up with plans like the ones you mention.)
  - Max H 22 Oct 2023 0:15 UTC
    6 points
    −6
    Parent
    It seems like you’re imagining some sort of side-channel in which the LLM can take “free actions,” which don’t count as next-tokens, before coming back and making a final prediction about the next-tokens. This does not resemble anything in LM likelihood training, or in the usual interaction modalities for LLMs.
    
    I’m saying that the lack of these side-channels implies that GPTs alone will not scale to human-level.
    
    If your system interface is a text channel, and you want the system behind the interface to accept inputs like the prompt above and return correct passwords as an output, then if the system is:
    
    an auto-regressive GPT directly fed your prompt as input, it will definitely fail
    A human with the ability to act freely in the background before returning an answer, it will probably succeed
    an AutoGPT-style system backed by a current LLM, with the ability to act freely in the background before returning an answer, it will probably fail. (But maybe if your AutoGPT implementation or underlying LLM is a lot stronger, it would work.)
    
    And my point is that, the reason the human probably succeeds and the reason AutoGPT might one day succeed, is precisely because they have more agency than a system that just auto-regressively samples from a language model directly.
    - Max H 22 Oct 2023 0:30 UTC
      5 points
      −8
      Parent
      Or, another way of putting it:
      
      It seems like you’re imagining some sort of side-channel in which the LLM can take “free actions,” which don’t count as next-tokens, before coming back and making a final prediction about the next-tokens. This does not resemble anything in LM likelihood training, or in the usual user interaction modalities for LLMs.
      
      These are limitations of current LLMs, which are GPTs trained via SGD. But there’s no inherent reason you can’t have a language model which predicts next tokens via shelling out to some more capable and more agentic system (e.g. a human) instead. The result would be a (much slower) system that nevertheless achieves lower loss according to the original loss function.