Seth Herd comments on Conflating value alignment and intent alignment is causing confusion

Seth Herd 10 Sep 2024 19:16 UTC
4 points
0
I just finished reading all of those links. I was familiar with Roger Dearnaleys’ proposal of synthetic data for alignment but not Beren Millidge’s. It’s a solid suggestion that I’ve added to my list of likely stacked alignment approaches, but not fully thought through. It does seem to have a higher tax/effort than the methods so I’m not sure we’ll get around to it before real AGI. But it doesn’t seem unlikely either.
I got caught up reading the top comment thread above the Turntrout/Wentworth exchange you linked. I’d somehow missed that by being off-grid when the excellent All the Shoggoths Merely Players came out. It’s my nomination for SOTA of the current alignment difficulty discussion.
I very much agree that we get to influence the path to coherent autonomous AGI. I think we’ll probably succeed in making aligned AGI- but then quite possibly tragically extinct ourselves with standard human combativeness/paranoia or foolishness—If we solve alignment, do we die anyway?
I think you’ve read that and we’ve had a discussion there, but I’m leaving that link here as the next step in this discussion now that we’ve reached approximate convergence.
- Noosphere89 10 Sep 2024 19:37 UTC
  2 points
  0
  Parent
  I just finished reading all of those links. I was familiar with Roger Dearnaleys’ proposal of synthetic data for alignment but not Beren Millidge’s. It’s a solid suggestion that I’ve added to my list of likely stacked alignment approaches, but not fully thought through. It does seem to have a higher tax/effort than the methods so I’m not sure we’ll get around to it before real AGI. But it doesn’t seem unlikely either.
  I agree it has a higher tax rate than RLHF, but to make the case for lower tax rates than people think, it’s because synthetic data will likely be a huge part of what makes AGI into ASI, as models require a lot of data, and synthetic data is a potentially huge industry in futures where AI progress is very high, because the amount of human data is both way too limiting for future AIs, and probably doesn’t show superhuman behavior like we want from LLMs/RL.
  Thus huge amounts of synthetic data will be heavily used as part of capabilities progress, meaning we can incentivize them to also put alignment data in the synthetic data.
  I very much agree that we get to influence the path to coherent autonomous AGI. I think we’ll probably succeed in making aligned AGI- but then quite possibly tragically extinct ourselves with standard human combativeness/paranoia or foolishness—If we solve alignment, do we die anyway?
  This is why I think we will need to use targeted removals of capabilities like LEACE combined with using synthetic data to remove infohazardous knowledge, combined with not open-weighting/open-sourcing models as AIs get more capable and only allowing controlled API use.
  Here’s the LEACE paper and code:
  https://github.com/EleutherAI/concept-erasure/pull/2
  https://github.com/EleutherAI/concept-erasure
  https://github.com/EleutherAI/concept-erasure/releases/tag/v0.2.0
  https://arxiv.org/abs/2306.03819
  https://blog.eleuther.ai/oracle-leace
  I’ll reread that post again.
  - Seth Herd 10 Sep 2024 22:26 UTC
    2 points
    0
    Parent
    Agreed on the capabilities advantages of synthetic data; so it might not be much of a tax at all to mix in some alignment.
    I don’t think removing infohazardous knowledge will work all the way into dangerous AGI, but I don’t think it has to—there’s a whole suite of other alignment techniques for language model agents that should suffice together.
    Keeping a superintelligence ignorant of certain concepts sounds impossible. Even a “real AGI” of the type I expect soon will be able to reason and learn, causing it to rapidly rediscover any concepts you’ve carefully left out of the training set. Leaving out this relatively easy capability (to reason and learn online) will hurt capabilities, so you’d have a huge uphill battle in keeping it out of deployed AGI. At least one current projects have already accomplished limited (but impressive) forms of this as part of their strategy to create useful LM agents. So I don’t think it’s getting rolled back or left out.
    - Noosphere89 10 Sep 2024 22:39 UTC
      4 points
      0
      Parent
      I agree with you that there are probably better methods to handle the misuse risk, and note I also pointed out them as options, not exactly guarantees.
      
      And yeah, I agree with this specifically:
      
      but I don’t think it has to—there’s a whole suite of other alignment techniques for language model agents that should suffice together.
      
      Thanks for mentioning that.
      
      Now that I think about it, I agree that it’s only a stop gap for misuse, and yeah if there is even limited generalization ability, I agree that LLMs will be able to rediscover dangerous knowledge, so we will need to make LLMs that don’t let users completely make bio-weapons for example.