Signer comments on The Compendium, A full argument about extinction risk from AGI

Signer 31 Oct 2024 16:32 UTC
13 points
2

Whack-A-Mole fixes, from RLHF to finetuning, are about teaching the system to not demonstrate problematic behavior, not about fundamentally fixing that behavior.

Based on what? Problematic behavior avoidance does actually generalize in practice, right?
- Simon Lermen 1 Nov 2024 11:17 UTC
  31 points
  20
  Parent
  Here is a way in which it doesn’t generalize in observed behavior:
  Alignment does not transfer well from chat models to agents
  TLDR: There are three new papers which all show the same finding, i.e. the safety guardrails from chat models don’t transfer well from chat models to the agents built from them. In other words, models won’t tell you how to do something harmful, but they will do it if given the tools. Attack methods like jailbreaks or refusal-vector ablation do transfer.
  Here are the three papers, I am the author of one of them:
  https://arxiv.org/abs/2410.09024
  https://static.scale.com/uploads/6691558a94899f2f65a87a75/browser_art_draft_preview.pdf
  https://arxiv.org/abs/2410.10871
  I thought of making a post here about this if it is interesting
  - Signer 1 Nov 2024 18:31 UTC
    4 points
    3
    Parent
    It sure doesn’t seem to generalize in GPT-4o case. But what’s the hypothesis for Sonnet 3.5 refusing in 85% of cases? And CoT improving score and o1 being better in browser suggests the problem is in models not understanding consequences, not in them not trying to be good. What’s the rate of capability generalization to agent environment? Are we going to conclude that Sonnet is just demonstrates reasoning, instead of doing it for real, if it solves only 85% of tasks it correctly talks about?
    
    Also, what’s the rate of generalization of unprompted problematic behaviour avoidance? It’s much less of a problem if your AI does what you tell it to do—you can just don’t give it to users, tell it to invent nanotechnology, and win.
    - Simon Lermen 3 Nov 2024 19:46 UTC
      2 points
      0
      Parent
      I had finishing this up on my to-do list for a while. I just made a full length post on it.
      
      https://www.lesswrong.com/posts/ZoFxTqWRBkyanonyb/current-safety-training-techniques-do-not-fully-transfer-to
      
      I think it’s fair to say that some smarter models do better at this, however, it’s still worrisome that there is a gap. Also attacks continue to transfer.

Signer comments on The Compendium, A full argument about extinction risk from AGI

Alignment does not transfer well from chat models to agents