Noosphere89 comments on Conflating value alignment and intent alignment is causing confusion

Noosphere89 Sep 10, 2024, 12:28 AM
4 points
0
I remain unclear on one thing: how do you expect to have a densely defined reinforcement signal?
Basically, via lots of synthetic data that always shows the AI acting aligned even when the human behaves badly, as well as synthetic data to make misaligned agents reveal themselves safely, and in particular it’s done early in the training run, before it can try to deceive or manipulate us.
More generally, the abuse of synthetic data means we have complete control over the inputs to the AI model, which means we can very easily detect stuff like deception and takeover risk.
For example, we can feed RL and LLM agents information about interpretability techniques not working, despite them actually working, or feed them exploits that are both easy and large for misaligned AI to do that seem to work, but doesn’t actually work.
More here:
https://www.beren.io/2024-05-11-Alignment-in-the-Age-of-Synthetic-Data/
https://www.lesswrong.com/posts/oRQMonLfdLfoGcDEh/a-bitter-lesson-approach-to-aligning-agi-and-asi-1
It’s best to make large synthetic datasets now, so that we can apply it continuously throughout AGI/ASI training, and in particular do it before it is capable of learning deceptiveness/training games.
I think mostly about AGI that is fully autonomous and therefore becomes coherent around its goals, for better or worse. I wonder if that might be another important difference of perspective. You said
it would be good for you to think of alignment methods on agentized RL systems like AlphaZero, but that they aren’t intrisincally agents, and are not much more dangerous than LLMs provided you’ve constrained the reward function enough.
I don’t understand why they wouldn’t intrinsically be agents after that RL training?
If that’s the type of scenario you’re addressing, I think that’s plausible for many AGI projects. But I think the same argument I make for LLMs and other “oracle” AGI: someone will turn it into a full real agent very soon; it will have more economic value, but even if it doesn’t, people will do it just for the hell of it, because it’s interesting.
With LLMs it’s as simple as repeating the prompt “keep working on that problem, pursuing goal X, using tools Y”. With another architecture, it might be a little different- but turning adequate intelligence into a true agent is almost trivial. Some monkey will pull that lever almost as soon as it’s available.
You’ve probably heard that argument somewhere before, so I may well be misunderstanding your scenario still.
I was just referring to this post on how RL policies aren’t automatically agents, without other assumptions. I agree that they will likely be agentized by someone if RL doesn’t agentize them, and I agree with your assumptions on why they will be agentic RL/LLM AIs.
https://www.lesswrong.com/posts/rmfjo4Wmtgq8qa2B7/think-carefully-before-calling-rl-policies-agents
Also, the argument against synthetic data working because raters make large amounts of compactly describable errors has evidence against it, at least in the data-constrained case.
Some relevant links are these:
https://www.lesswrong.com/posts/8yCXeafJo67tYe5L4/#74DdsQ7wtDnx4ChDX
https://www.lesswrong.com/posts/8yCXeafJo67tYe5L4/#R9Bfu6tzmuWRCT6DB
https://www.lesswrong.com/posts/8yCXeafJo67tYe5L4/?commentId=AoxYQR9jLSLtjvLno#AoxYQR9jLSLtjvLno
At a broader level, my point is that even conditional on you being correct that fully autonomous AI that is coherent across goals will be trained by somebody soon, the path to being coherent and autonomous is both important and influenceable to be more aligned by us.

Thanks for the dialogue here, this is useful for my work on my draft post “how we’ll try to align AGI”.
And thank you for being willing to read so much. I will ask you to read more posts and comments here, so that I can finally explicate what exactly is the plan to align AGI via RL or LLMs, which is large synthetic datasets.
What links here?
- Seth Herd Sep 10, 2024, 7:16 PM
  4 points
  0
  Parent
  I just finished reading all of those links. I was familiar with Roger Dearnaleys’ proposal of synthetic data for alignment but not Beren Millidge’s. It’s a solid suggestion that I’ve added to my list of likely stacked alignment approaches, but not fully thought through. It does seem to have a higher tax/effort than the methods so I’m not sure we’ll get around to it before real AGI. But it doesn’t seem unlikely either.
  I got caught up reading the top comment thread above the Turntrout/Wentworth exchange you linked. I’d somehow missed that by being off-grid when the excellent All the Shoggoths Merely Players came out. It’s my nomination for SOTA of the current alignment difficulty discussion.
  I very much agree that we get to influence the path to coherent autonomous AGI. I think we’ll probably succeed in making aligned AGI- but then quite possibly tragically extinct ourselves with standard human combativeness/paranoia or foolishness—If we solve alignment, do we die anyway?
  I think you’ve read that and we’ve had a discussion there, but I’m leaving that link here as the next step in this discussion now that we’ve reached approximate convergence.
  - Noosphere89 Sep 10, 2024, 7:37 PM
    2 points
    0
    Parent
    I just finished reading all of those links. I was familiar with Roger Dearnaleys’ proposal of synthetic data for alignment but not Beren Millidge’s. It’s a solid suggestion that I’ve added to my list of likely stacked alignment approaches, but not fully thought through. It does seem to have a higher tax/effort than the methods so I’m not sure we’ll get around to it before real AGI. But it doesn’t seem unlikely either.
    I agree it has a higher tax rate than RLHF, but to make the case for lower tax rates than people think, it’s because synthetic data will likely be a huge part of what makes AGI into ASI, as models require a lot of data, and synthetic data is a potentially huge industry in futures where AI progress is very high, because the amount of human data is both way too limiting for future AIs, and probably doesn’t show superhuman behavior like we want from LLMs/RL.
    Thus huge amounts of synthetic data will be heavily used as part of capabilities progress, meaning we can incentivize them to also put alignment data in the synthetic data.
    I very much agree that we get to influence the path to coherent autonomous AGI. I think we’ll probably succeed in making aligned AGI- but then quite possibly tragically extinct ourselves with standard human combativeness/paranoia or foolishness—If we solve alignment, do we die anyway?
    This is why I think we will need to use targeted removals of capabilities like LEACE combined with using synthetic data to remove infohazardous knowledge, combined with not open-weighting/open-sourcing models as AIs get more capable and only allowing controlled API use.
    Here’s the LEACE paper and code:
    https://github.com/EleutherAI/concept-erasure/pull/2
    https://github.com/EleutherAI/concept-erasure
    https://github.com/EleutherAI/concept-erasure/releases/tag/v0.2.0
    https://arxiv.org/abs/2306.03819
    https://blog.eleuther.ai/oracle-leace
    I’ll reread that post again.
    - Seth Herd Sep 10, 2024, 10:26 PM
      2 points
      0
      Parent
      Agreed on the capabilities advantages of synthetic data; so it might not be much of a tax at all to mix in some alignment.
      I don’t think removing infohazardous knowledge will work all the way into dangerous AGI, but I don’t think it has to—there’s a whole suite of other alignment techniques for language model agents that should suffice together.
      Keeping a superintelligence ignorant of certain concepts sounds impossible. Even a “real AGI” of the type I expect soon will be able to reason and learn, causing it to rapidly rediscover any concepts you’ve carefully left out of the training set. Leaving out this relatively easy capability (to reason and learn online) will hurt capabilities, so you’d have a huge uphill battle in keeping it out of deployed AGI. At least one current projects have already accomplished limited (but impressive) forms of this as part of their strategy to create useful LM agents. So I don’t think it’s getting rolled back or left out.
      - Noosphere89 Sep 10, 2024, 10:39 PM
        4 points
        0
        Parent
        I agree with you that there are probably better methods to handle the misuse risk, and note I also pointed out them as options, not exactly guarantees.
        
        And yeah, I agree with this specifically:
        
        but I don’t think it has to—there’s a whole suite of other alignment techniques for language model agents that should suffice together.
        
        Thanks for mentioning that.
        
        Now that I think about it, I agree that it’s only a stop gap for misuse, and yeah if there is even limited generalization ability, I agree that LLMs will be able to rediscover dangerous knowledge, so we will need to make LLMs that don’t let users completely make bio-weapons for example.