Jonathan Claybrough comments on Agentized LLMs will change the alignment landscape

Jonathan Claybrough 11 Apr 2023 10:18 UTC
4 points
4
Quick meta comment to express I’m uncertain that posting things in lists of 10 is a good direction. The advantages might be real, easy to post, quick feedback, easy interaction, etc.

But the main disadvantage is that this comparatively drowns out other better posts (with more thought and value in them). I’m unsure if the content of the post was also importantly missing from the conversation (to many readers) and that’s why this got upvoted so fast or if it’s a lot the format… Even if this post isn’t bad (and I’d argue it is for the suggestions it promotes), this is early warning of a possible trend that people with less thought out takes quickly post highly accessible content, get comparatively more upvotes than should, and it’s harder to find good content.

(Additional disclosure, some of my bad taste for this post come from the fact its call to break all encryption is being cited on Twitter as representative of the alignment community—I’d have liked to answer that obviously no, but it got many upvotes! This makes my meta point also seem to be motivated by PR/optics which is why it felt necessary to disclose but let’s mostly focus on consequences inside the community)
- Seth Herd 11 Apr 2023 16:12 UTC
  7 points
  5
  Parent
  Well, I thought this was a really new and important set of ideas that I’d thought out plenty well enough to start a conversation. Like it says in the intro. I thought that’s why it was getting upvotes. Thanks for correcting me.
  
  As for the encryption breaking suggestion showing up on Twitter and making the alignment community look bad—that I regret. I hadn’t seen this since I avoid Twitter like the plague, but I realize that it’s a cauldron for forming public opinion. The PR battle is real, and it could be critical. In the future I’ll realize that little pieces will be taken out of context and amplified to ill effect, and just leave the most controversial ideas out when they’re not critical to the main point, like that one wasn’t.
  - Jonathan Claybrough 12 Apr 2023 12:21 UTC
    5 points
    1
    Parent
    To not be misinterpreted, I didn’t say I’m sure it’s more the format than the content that’s causing the upvotes (open question), nor that this post doesn’t meet the absolute quality bar that normally warrants 100+ upvote (to each reader their opinion).
    
    If you’re open to object level discussing this, I can point on concrete disagreement with the content. Most importantly, this should not be seen as a paradigm shift, because it does not invalidate any of the previous threat models—it would only be so if it rendered impossible to do AGI any other way. I also don’t think this should “change the alignment landscape” because it’s just another part of it, one which was known and has been worked on for years (Anthropic and OpenAI have been “aligning” LLMs and I’d bet 10:1 anticipated these would be used to do agents like most people I know in alignment).
    
    To clarify, I do think it’s really important and great people work on this, and that in order this will be the first x-risk stuff we see. But we could solve the GPT-agent problem and still die to unalignment AGI 3 months afterwards. The fact that the world trajectory we’re on is throwing additional problems in the mix (keeping the world safe from short term misuse and unaligned GPT-agents) doesn’t make the existing ones simpler. There still is pressure to built autonomous AGI, there might still be mesa optimizers, there might still be deception, etc. We need the manpower to work on all of these, and not “shift the alignment landscape” to just focus on the short term risks.
    
    I’d recommend to not worry much about PR risk, just ask the direct question: Even if this post is only ever read by LW folk, does the “break all encryption” add to the conversation? Causing people to take time to debunk certain suggestions isn’t productive even without the context of PR risk
    
    Overall I’d like some feedback on my tone, if it’s too direct/aggressive to you of it’s fine. I can adapt.
    - Seth Herd 13 Apr 2023 2:18 UTC
      2 points
      0
      Parent
      I think your concern with the list of ten format is totally reasonable. I do feel that the tone was a little too bombastic. I personally felt that was mostly justified, but hey, I’m highly biased. I haven’t noticed a flood of top ten lists getting lots of upvotes, though. And I did find that format to be incredibly helpful in getting a post out in reasonable time. I usually struggle and obsess.
      I would love to engage on the object level on this particular one. I’m going to resist doing that here because I want to produce a better version of this post, and engage there. I said that this doesn’t solve all of the problems with alignment, including citing my post the alignment stability problem. The reason I thought this was so important is that it’s a huge benefit if, at least in the short term, the easiest way to make more capable AI is also the easiest to align.
      Thank you for noticing my defensive and snarky response, and asking for feedback on your tone. I did take your comment to be harsh. I had the thought “oh no, when you make a post that people actually comment on, some of them are going to be mean. This community isn’t as nice as I’d hoped”.
      I don’t think I misinterpreted your comment very badly. You said
      Even if this post isn’t bad (and I’d argue it is for the suggestions it promotes),
      that suggestion was not in this post! I suggested it in a comment, and it was in the alignment stability problem post. Maybe that’s why that post got only ten upvotes on less wrong, and actually had slightly more on the alignment forum where I posted it. And my comment here was roundly criticized and downvoted. You can tell that to the mob of Twits if you want.
      You also firmly implied that it was being upvoted for the format and not the ideas. You made that comment while misunderstanding why I was arguing it was important (maybe you read it quickly since it was irritating you, or maybe I didn’t write it clearly enough).
      You commented in the post that it irritated you because of the external consequence, but it seems you didn’t adequately compensate for that bias before writing and posting it. Which is fine, we all have feelings and none of us are even close to perfect rationalists.
      Including me. I’m getting snarky again, and I shouldn’t since I really don’t think your tone was that bad. But I do wish it was more pleasant. I’m not the tone police, but I do really really value this community’s general spirit of harsh disagreement coupled with pleasant tone.
      And tone matters. I’m not going to be mentioning the encryption thing again unnecessarily (despite it playing an absolutely critical role in my overall hope for our survival), because we are now in a PR role. My post on preventing polarization in the AI safety debate says it. I don’t like playing the PR game, but it’s time for us to learn it and do it, because polarization will prevent policy action, like it has in the climate change issue.
      Once again, I truly and deeply appreciate you mentioning the tone. It totally restored my enthusiasm for being part of this community as an imperfect but much better corner of the internet. I hope I haven’t been to harsh in response, because I am not overly pleasant by nature, but have been trying to turn myself into a more pleasant person, while still saying important things when they’re useful.
      I hope to have your input on future posts. Now, back to writing this one.
    - Seth Herd 19 Apr 2023 15:29 UTC
      1 point
      0
      Parent
      Here’s that more complete writeup, going into more depth on all of your object-level points above
      Capabilities and alignment of LLM cognitive architectures
      Curiously, it’s fallen of the front page quickly. I guess I should’ve written something in between the two in length and tone.
      I certainly didn’t mean we should stop working on everything else, just that we should start thinking about this, and working on it if it shows the progress I think it will.
- Seth Herd 11 Apr 2023 17:47 UTC
  1 point
  0
  Parent
  You mention that you’re unsure if the questions in this post were importantly missing from the conversation. If you know of prior discussion on these points, please let me know.
  
  My impression is that the ease of agentized LLMs was well recognized, but the potential large upsides and changes for alignment projects was not previously recognized in this community. It seems like if there’s even a perceived decent chance of that direction, we should’ve been hearing about it.
  
  I’m going to do a more careful post and submit it to Alignment Forum. That should turn up prior work on this.
  - Jonathan Claybrough 12 Apr 2023 13:36 UTC
    1 point
    0
    Parent
    Did you know about “by default, GPTs think in plain sight”?
    It doesn’t explicitly talk about agentized GPTs but was discussing the impact this has on GPTs for AGI and how it affects the risks, and what we should do about it (eg. maybe rlhf is dangerous)
    - Seth Herd 12 Apr 2023 17:07 UTC
      2 points
      0
      Parent
      Thank you. I think it is relevant. I just found it yesterday following up on this. The comment there by Gwern is a really interesting example of how we could accidentally introduce pressure for them to use steganography so their thoughts aren’t in English.
      
      What I’m excited about is that agentizing them, while dangerous, could mean they not only think in plain sight, but they’re actually what gets used. That would cross from only being able to say how to get alignment, to making it so.ething the world would actually do.

Jonathan Claybrough comments on Agentized LLMs will change the alignment landscape

Capabilities and alignment of LLM cognitive architectures