Seth Herd comments on Agentized LLMs will change the alignment landscape

Seth Herd 9 Apr 2023 18:12 UTC
6 points
1
Isn’t there a pretty big difference between “doesn’t work after a couple weeks” and “doesn’t work?” Time will tell, but I really think this general approach is going to go far. Just for instance, if a setup gets into loops, how about asking GPT “does this list of recent goals look repetitive to you?” and conditioning its next steps on that. In general, having a sub-sentient linguistic engine available makes it really easy to make loops that do sensible things. It will still take some engineering, though, so little wonder it’s not super effective right now.
- Thane Ruthenis 9 Apr 2023 18:34 UTC
  4 points
  −1
  Parent
  I mean, I’m sure some tinkering will improve this thing’s capabilities somewhat, and the range of uses. I just don’t think it’d create any qualitatively novel capabilities, like making it an actual AGI; or that this sort of tinkering will play a pivotal part in the extinction event.
  In my view, if an AI model has “transformative” capabilities, it’d probably either show them out of the box, or given just a minimal self-prompt wrapper. On the other hand, if it doesn’t have them, no simple wrapper is going to make it transformative.
  I guess a particularly clever wrapper can make it have “indirect” upturning effects, like making it really easy for anyone to write complex software or something. But that’s not quite the class of outcomes I see spiraling into an omnicide.
  - Seth Herd 9 Apr 2023 23:27 UTC
    3 points
    1
    Parent
    I agree that improving AutoGPT won’t turn GPT-4 into human-level, omnicide-capable AGI.
    
    But what about those improvements also running GPT-6?
    
    I still see more reason for rejoicing than for fear. I really hope this technique IS super successful. Because then you have an AI with goals stated in intuitive natural language, and that thinks in natural language for easy monitoring by a simple sentiment/deception/”evil” detecting AI monitoring all of its many thoughts (and there will be many). This makes initial alignment and corrigibility trivially easy, so most people will actually do them. And it makes interpretability so vastly much easier.
    - Roger Dearnaley 22 Apr 2023 4:44 UTC
      3 points
      0
      Parent
      I agree that an Auto-GPT style agent using self-editing context is likely to communicate in a natural language. Probably English, as the language GPT-4′s training data contained the most of. A future Auto-GPT-N and attempting to avoid monitoring of its stream-of-consciousness might start speaking a more obscure language, or doing a simple encryption like Base64 encoding: it’s not hard to get GPT3.5 to take input and give output in Base64, but it’s enough extra cognitive load per token, and/or messes up the token embeddings enough, that the resulting very-lightly-encrypted LLM is fairly brain damaged. So this is an example of a ‘misalignment tax’, where attempting to be deceptive imposes a significant additional cost.
      Of course Base64 isn’t a cryptographically strong encryption scheme. Even for some very smart future AutoGPT-N running on GPT-N, it couldn’t use encryption in its context stream without somewhere to put the key (or some riddle or set of trivia questions whose answer was the key, or whatever). However, if smart enough, it might be able to do something like steganographically concealing a side-channel message in an innocent-looking English text, say as some form of lightly-encrypted/compressed acrostic, but it would have to be smart enough that every time it ran GPT-N against the text to generate another token, that spotted the acrostic, figured out how to decrypt it, paid attention to the side-channel message rather than the main one, and played along and continued this exercise. Not theoretically impossible for something enough smarter than a human, but the misalignment tax gets a good deal bigger, and it’s not obvious why something trained on imitating text generated by humans would learn this ability, except through inadvertent fine-tuning on text contaminated with such acrostic stenographic side-channels embedded. Obviously an aligned, comparably smart AutoGPT-N not paying the no-alignment overhead of having to steganographically conceal its thoughts that was used for monitoring the misaligned Auto-GPT-N shouldn’t find it too hard to catch on to what’s happening here, even if a human couldn’t do it. If nothing else, the pressure of having to fulfill the acrostic requirements for the side-channel is going to make it hard to make the innocent-looking text plausible, presumably raising its perplexity when analyzed by an LLM that hasn’t spotted the side-channel.
    - Thane Ruthenis 10 Apr 2023 9:02 UTC
      3 points
      0
      Parent
      But what about those improvements also running GPT-6?
      Same reasoning: I expect that GPT-N will be omnicide-capable out of the box / with a minimal self-prompting wrapper, else not at all. For any given AI model, a marginally better wrapper isn’t going to tide it over to transformative AI. Thus, if a new model is released, and the first dumb “let’s give it agency!” idea doesn’t work, we can probably relax about that specific model entirely. (This is mainly in opposition to your original claim that Auto-GPT can “fan the sparks of AGI in GPT-4 into a flame”.)
      On a larger scale, if progressively larger and more capable models based on a given architecture keep not taking off when put in a self-prompt wrapper, and they keep failing in the exact same way, that’s probably evidence that the entire architecture is safe. And I think GPT-4 is failing in the same way GPT-3 or GPT-2 would’ve.
      Not to say that I am, at this point, utterly confident that GPT-N isn’t going to take off; I’m not. But inasmuch as Auto-GPT’s performance is evidence for or against that, I think it’s evidence against.
      Because then you have an AI with goals stated in intuitive natural language
      Yeah, that’s… part of the reason I don’t expect this to work. I don’t think any text output should be viewed as the LLM’s “thoughts”. Whatever thoughts it has happen inside forward passes, and I don’t think it natively maps them into the human-legible monologues in which the wider Auto-GPT “thinks”. I think there’s a fundamental disconnect between the two kinds of “cognition”, and the latter type is much weaker.
      If GPT-N were AGI, it would’ve recognized the opportunity offered by the self-wrapper, and applied optimization from its end, figured out how to map from native-thoughts into language-thoughts, and thereby made even the dumbest wrapper work. But it didn’t do that, and I don’t think any improvement to the wrapper is going to make it do that, because it fundamentally can’t even try to figure out how. The problem is on its end, within the frozen parameters, in the lacking internal architecture. Its mental ontology doesn’t have the structures for even conceiving of performing this kind of operation, and it’s non-AGI so it can’t invent the idea on the fly.
      (Man, I’m going to eat so much crow when some insultingly dumb idea on the order of “let’s think step-by-step!” gets added to the Auto-GPT wrapper next month and it takes off hard and kills everyone.)
      - Seth Herd 10 Apr 2023 13:03 UTC
        5 points
        0
        Parent
        Interesting. Thanks for your thoughts. I think this difference of opinion shows me where I’m not fully explaining my thinking. And some differences between human thinking and LLM “thinking”. In humans, the serial nature of linking thoughts together is absolutely vital to our intelligence. But LLMs have a lot more seriality in the production of each utterance.
        
        I think I need to write another post that goes much further into my reasoning here to work this out. Thanks for the conversation.
        Thane Ruthenis 10 Apr 2023 13:20 UTC
        2 points
        0
        Parent
        Glad it was productive!
        I think this difference of opinion shows me where I’m not fully explaining my thinking
        I perceive a lot of inferential distance on my end as well. My model here is informed by a number of background conclusions that I’m fairly confident in, but which haven’t actually propagated into the set of commonly-assumed background assumptions.
        beren 10 Apr 2023 20:51 UTC
        5 points
        1
        Parent
        I perceive a lot of inferential distance on my end as well. My model here is informed by a number of background conclusions that I’m fairly confident in, but which haven’t actually propagated into the set of commonly-assumed background assumptions.
        I have found this conversation very interesting. Would be very interested if you could do a quick summary or writeup of the background conclusions you are referring to. I have my own thoughts about the feasibility of massive agency gains from AutoGPT like wrappers but would be interested to hear your thoughts
        Thane Ruthenis 3 May 2023 15:23 UTC
        3 points
        0
        Parent
        Here’s the future post I was referring to!
        Seth Herd 4 May 2023 17:31 UTC
        3 points
        0
        Parent
        I saw it. I really like it. Despite my relative enthusiasm for LMCA alignment, I think the points you raise there mean it’s still quite a challenge to get it right enough to survive.
        
        I’ll try to give you a substantive response on that post today.
        Thane Ruthenis 10 Apr 2023 21:02 UTC
        3 points
        0
        Parent
        I may make a post about it soon. I’ll respond to this comment with a link or a summary later on.