Thane Ruthenis comments on Agentized LLMs will change the alignment landscape

Thane Ruthenis 9 Apr 2023 7:46 UTC
20 points
4
Funny, Auto-GPT stuff actually makes me less worried about GPT-4 and its scale-ups. It’s been out for weeks, less impressive variants were out for longer, and so far, nothing much has come from it. Looking at the ChaosGPT video… I would’ve predicted that it wasn’t actually any good at pursuing its goal, that it just meandered around the “kill everyone” objective without ever meaningfully progressing — and lo and behold, it’s doing exactly that. Actually, it’s worse at it than I’d expected.
I see the case for doom, I do! It’s conceivable that it will turn out in this manner. We’re witnessing an AI planning, here, and it’s laughably bad planning so far, but the mere fact that they can do it at all implies a readily-available possibility of making them massively better at it. So in e. g. five more years, we’d get AIs whose planning skills to ChaosGPT as Midjourney is to PixelCNN, and then maybe one of them FOOMs.
But mostly, I agree with this view. And this is an instance of the “wire together GPT models to get an AGI” attempt failing, and on my view it’s failing in a way that’s providing some evidence this entire approach won’t work. It’s conceivable that it’d work with GPT≥5, or with a more clever setup… But this is an update away from that. (Though I’m not as optimistic about the next architectural advance requiring 10-15 years; it may be just around the corner. But scaled-up LLMs haven’t really impressed me since GPT-3.)
- Nanda Ale 9 Apr 2023 9:05 UTC
  9 points
  2
  Parent
  I’d be wary of generalizing too much from Auto-GPT. It’s in a weird place. It’s super popular as a meme anyone can run—you don’t have to be a programmer! But skimming the github the vast vast majority of people are getting hung up on fiddly technical and programming bits. And people who wouldn’t get hung up on that stuff don’t really get much out of Auto-GPT. There’s some overlap—it’s a very entertaining idea and thing to watch, the idea of it being hands off. I personally watched it like a TV show for hours, and it going off the rails was part of the fun.
  Like I’m no expert, I just got way too addicted to goofing around with LLMs, and the way Auto-GPT is trying to make this work seems obviously flawed to me. Not the software quality—I don’t much about that—but the main idea and the structure of the interacting prompts seems like just clearly not the way to go. I don’t know the right way, but it’s not that.
  Even more so for ChaosGPT, where the author (to me) looks like somebody trying to maximize entertainment, not a working product.
  That said Auto-GPT is actually getting better quickly. AI time moves fast. And it’s so popular that a lot of people are tinkering and eyes on it. So it might actually do something like the original concept eventually. But I would bet something completely different (specifically a project that isn’t trying to be a plug-and-play solution anyone can run on their own computer) is where the most capable solutions will be.
  - Thane Ruthenis 9 Apr 2023 9:59 UTC
    5 points
    0
    Parent
    In my view, if something like Auto-GPT can work, its ability to work is probably not too sensitive to the exact implementation of the wrapper. If GPT-4 has the raw capability to orient itself to reality and navigate it, it should be able to do that with even bare-bones self-prompt/prompted self-reflection ability. Something like Auto-GPT should be more than enough. So the failure is suggestive, is evidence about this whole landscape of approaches.
    I agree that it’s possible that more nuanced wrapper designs would work, but I don’t place much probability on that.
    - Nanda Ale 9 Apr 2023 11:59 UTC
      15 points
      7
      Parent
      I’m not confident at all Auto-GPT could work at its goals, just that in narrower domains the specific system or arrangement of prompt interactions matters. To give a specific example, I goof around trying to get good longform D&D games out of ChatGPT. (Even GPT-2 fine-tuned on Crit Role transcripts, originally.) Some implementations just work way better than others.
      The trivial system is no system—just play D&D. Works great until it feels like the DM is the main character in Memento. The trivial next step, rolling context window. Conversation fills up, ask for summary, start a new conversation with the summary. Just that is a lot better. But you really feel loss of detail in the sudden jump, so why not make it continuous. A secretary GPT with one job, prune the DM GPT conversation text after every question and answer, always try to keep most important and most recent. Smoother than the summary system. Maybe the secretary can not just delete but keep some details instead, maybe use half its tokens for a permanent game-state. Then it can edit useful details in/out of the conversation history. Can the secretary write a text file for old conversations? Etc. etc.
      Maybe the difference is the user plays the D&D, so you know immediately when it’s not working well. It’s usually obvious in minutes. Auto-GPT is supposed to automatic. So they add features and just kind of hope the AI figures it out from there. They don’t get the immediate “this is not working at all” feedback. Like they added embeddings 5 days ago—it just prints the words “Permanent memory:” in the prompt, followed by giant blogs up to 2500 tokens of the most related text from Pinecone. Works great for chatbots answering a single question about technical documentation. Real easy to imagine how it could fall apart when does iteratively over longer time periods. I can’t imagine this would work for a D&D game, it might be worse than having no memory. My gut feeling is you pull in the 2500 most related tokens of content into your prompt and the system is overall more erratic. You get the wrong 2500 tokens, it overwhelms whatever the original prompt was, now what is your agent up to? Just checked now, it changed to “This reminds you of these events from your past:”. That might actually make it somewhat less likely to blow up. Basically making the context of the text more clear: “These are old events and thoughts, and you are reminded of them, don’t take this text too seriously, this text might not even be relevant so maybe you should even ignore it. It’s just some stuff that came to mind, that’s how memories work sometimes.”
    - Vladimir_Nesov 9 Apr 2023 14:05 UTC
      8 points
      4
      Parent
      
      If GPT-4 has the raw capability to orient itself to reality and navigate it, it should be able to do that with even bare-bones self-prompt/prompted self-reflection ability.
      
      GPT-4 by itself can’t learn, can’t improve its intuitions and skills in response to new facts of the situations of its instances (that don’t fit in its context). So the details of how the prosthetics that compensate for that are implemented (or guided in the way they are to be implemented) can well be crucial.
      
      And also, at some point there will be open sourced pre-trained and RLAIFed models of sufficient scale that allow fine-tuning, that can improve their intuitions, at which point running them inside an improved Auto-GPT successor might be more effective than starting the process from scratch, lowering the minimum necessary scale of the pre-trained foundational model.
      
      Which increases the chances that first AGIs are less intelligent than they would need to be otherwise. Which is bad for their ability to do better than humans at not building intentionally misaligned AGIs the first chance they get.
    - Roger Dearnaley 22 Apr 2023 4:14 UTC
      3 points
      2
      Parent
      It’s also quite likely that something like Auto-GPT would work a lot better using a version of LLM that had been fine-tuned/reinforcement-trained for this specific usecase—just as Chat-GPT is a lot more effective as a chatbot than the underlying GPT-3 model was before the specialized training. If the LLM is optimized for the wrapper and the wrapper designed to make efficient use of the entire context-size of the LLM, thinks are going to work a lot better.
      - RogerDearnaley 5 Dec 2023 9:09 UTC
        1 point
        0
        Parent
        7 months later, we now know that this is true. Also, we now know that you can take output from a prompted/scaffolded LLM and use it to fine-tune another LLM to do the same things without needing prompt/scaffold.
        RohanS 25 Jul 2024 8:40 UTC
        2 points
        0
        Parent
        Could you please point out the work you have in mind here?
- Seth Herd 9 Apr 2023 18:12 UTC
  6 points
  1
  Parent
  Isn’t there a pretty big difference between “doesn’t work after a couple weeks” and “doesn’t work?” Time will tell, but I really think this general approach is going to go far. Just for instance, if a setup gets into loops, how about asking GPT “does this list of recent goals look repetitive to you?” and conditioning its next steps on that. In general, having a sub-sentient linguistic engine available makes it really easy to make loops that do sensible things. It will still take some engineering, though, so little wonder it’s not super effective right now.
  - Thane Ruthenis 9 Apr 2023 18:34 UTC
    4 points
    −1
    Parent
    I mean, I’m sure some tinkering will improve this thing’s capabilities somewhat, and the range of uses. I just don’t think it’d create any qualitatively novel capabilities, like making it an actual AGI; or that this sort of tinkering will play a pivotal part in the extinction event.
    In my view, if an AI model has “transformative” capabilities, it’d probably either show them out of the box, or given just a minimal self-prompt wrapper. On the other hand, if it doesn’t have them, no simple wrapper is going to make it transformative.
    I guess a particularly clever wrapper can make it have “indirect” upturning effects, like making it really easy for anyone to write complex software or something. But that’s not quite the class of outcomes I see spiraling into an omnicide.
    - Seth Herd 9 Apr 2023 23:27 UTC
      3 points
      1
      Parent
      I agree that improving AutoGPT won’t turn GPT-4 into human-level, omnicide-capable AGI.
      
      But what about those improvements also running GPT-6?
      
      I still see more reason for rejoicing than for fear. I really hope this technique IS super successful. Because then you have an AI with goals stated in intuitive natural language, and that thinks in natural language for easy monitoring by a simple sentiment/deception/”evil” detecting AI monitoring all of its many thoughts (and there will be many). This makes initial alignment and corrigibility trivially easy, so most people will actually do them. And it makes interpretability so vastly much easier.
      - Roger Dearnaley 22 Apr 2023 4:44 UTC
        3 points
        0
        Parent
        I agree that an Auto-GPT style agent using self-editing context is likely to communicate in a natural language. Probably English, as the language GPT-4′s training data contained the most of. A future Auto-GPT-N and attempting to avoid monitoring of its stream-of-consciousness might start speaking a more obscure language, or doing a simple encryption like Base64 encoding: it’s not hard to get GPT3.5 to take input and give output in Base64, but it’s enough extra cognitive load per token, and/or messes up the token embeddings enough, that the resulting very-lightly-encrypted LLM is fairly brain damaged. So this is an example of a ‘misalignment tax’, where attempting to be deceptive imposes a significant additional cost.
        Of course Base64 isn’t a cryptographically strong encryption scheme. Even for some very smart future AutoGPT-N running on GPT-N, it couldn’t use encryption in its context stream without somewhere to put the key (or some riddle or set of trivia questions whose answer was the key, or whatever). However, if smart enough, it might be able to do something like steganographically concealing a side-channel message in an innocent-looking English text, say as some form of lightly-encrypted/compressed acrostic, but it would have to be smart enough that every time it ran GPT-N against the text to generate another token, that spotted the acrostic, figured out how to decrypt it, paid attention to the side-channel message rather than the main one, and played along and continued this exercise. Not theoretically impossible for something enough smarter than a human, but the misalignment tax gets a good deal bigger, and it’s not obvious why something trained on imitating text generated by humans would learn this ability, except through inadvertent fine-tuning on text contaminated with such acrostic stenographic side-channels embedded. Obviously an aligned, comparably smart AutoGPT-N not paying the no-alignment overhead of having to steganographically conceal its thoughts that was used for monitoring the misaligned Auto-GPT-N shouldn’t find it too hard to catch on to what’s happening here, even if a human couldn’t do it. If nothing else, the pressure of having to fulfill the acrostic requirements for the side-channel is going to make it hard to make the innocent-looking text plausible, presumably raising its perplexity when analyzed by an LLM that hasn’t spotted the side-channel.
      - Thane Ruthenis 10 Apr 2023 9:02 UTC
        3 points
        0
        Parent
        But what about those improvements also running GPT-6?
        Same reasoning: I expect that GPT-N will be omnicide-capable out of the box / with a minimal self-prompting wrapper, else not at all. For any given AI model, a marginally better wrapper isn’t going to tide it over to transformative AI. Thus, if a new model is released, and the first dumb “let’s give it agency!” idea doesn’t work, we can probably relax about that specific model entirely. (This is mainly in opposition to your original claim that Auto-GPT can “fan the sparks of AGI in GPT-4 into a flame”.)
        On a larger scale, if progressively larger and more capable models based on a given architecture keep not taking off when put in a self-prompt wrapper, and they keep failing in the exact same way, that’s probably evidence that the entire architecture is safe. And I think GPT-4 is failing in the same way GPT-3 or GPT-2 would’ve.
        Not to say that I am, at this point, utterly confident that GPT-N isn’t going to take off; I’m not. But inasmuch as Auto-GPT’s performance is evidence for or against that, I think it’s evidence against.
        Because then you have an AI with goals stated in intuitive natural language
        Yeah, that’s… part of the reason I don’t expect this to work. I don’t think any text output should be viewed as the LLM’s “thoughts”. Whatever thoughts it has happen inside forward passes, and I don’t think it natively maps them into the human-legible monologues in which the wider Auto-GPT “thinks”. I think there’s a fundamental disconnect between the two kinds of “cognition”, and the latter type is much weaker.
        If GPT-N were AGI, it would’ve recognized the opportunity offered by the self-wrapper, and applied optimization from its end, figured out how to map from native-thoughts into language-thoughts, and thereby made even the dumbest wrapper work. But it didn’t do that, and I don’t think any improvement to the wrapper is going to make it do that, because it fundamentally can’t even try to figure out how. The problem is on its end, within the frozen parameters, in the lacking internal architecture. Its mental ontology doesn’t have the structures for even conceiving of performing this kind of operation, and it’s non-AGI so it can’t invent the idea on the fly.
        (Man, I’m going to eat so much crow when some insultingly dumb idea on the order of “let’s think step-by-step!” gets added to the Auto-GPT wrapper next month and it takes off hard and kills everyone.)
        Seth Herd 10 Apr 2023 13:03 UTC
        5 points
        0
        Parent
        Interesting. Thanks for your thoughts. I think this difference of opinion shows me where I’m not fully explaining my thinking. And some differences between human thinking and LLM “thinking”. In humans, the serial nature of linking thoughts together is absolutely vital to our intelligence. But LLMs have a lot more seriality in the production of each utterance.
        
        I think I need to write another post that goes much further into my reasoning here to work this out. Thanks for the conversation.
        Thane Ruthenis 10 Apr 2023 13:20 UTC
        2 points
        0
        Parent
        Glad it was productive!
        I think this difference of opinion shows me where I’m not fully explaining my thinking
        I perceive a lot of inferential distance on my end as well. My model here is informed by a number of background conclusions that I’m fairly confident in, but which haven’t actually propagated into the set of commonly-assumed background assumptions.
        beren 10 Apr 2023 20:51 UTC
        5 points
        1
        Parent
        I perceive a lot of inferential distance on my end as well. My model here is informed by a number of background conclusions that I’m fairly confident in, but which haven’t actually propagated into the set of commonly-assumed background assumptions.
        I have found this conversation very interesting. Would be very interested if you could do a quick summary or writeup of the background conclusions you are referring to. I have my own thoughts about the feasibility of massive agency gains from AutoGPT like wrappers but would be interested to hear your thoughts
        Thane Ruthenis 3 May 2023 15:23 UTC
        3 points
        0
        Parent
        Here’s the future post I was referring to!
        Seth Herd 4 May 2023 17:31 UTC
        3 points
        0
        Parent
        I saw it. I really like it. Despite my relative enthusiasm for LMCA alignment, I think the points you raise there mean it’s still quite a challenge to get it right enough to survive.
        
        I’ll try to give you a substantive response on that post today.
        Thane Ruthenis 10 Apr 2023 21:02 UTC
        3 points
        0
        Parent
        I may make a post about it soon. I’ll respond to this comment with a link or a summary later on.