Seth Herd comments on Conflating value alignment and intent alignment is causing confusion

Seth Herd 10 Sep 2024 22:26 UTC
2 points
0
Agreed on the capabilities advantages of synthetic data; so it might not be much of a tax at all to mix in some alignment.
I don’t think removing infohazardous knowledge will work all the way into dangerous AGI, but I don’t think it has to—there’s a whole suite of other alignment techniques for language model agents that should suffice together.
Keeping a superintelligence ignorant of certain concepts sounds impossible. Even a “real AGI” of the type I expect soon will be able to reason and learn, causing it to rapidly rediscover any concepts you’ve carefully left out of the training set. Leaving out this relatively easy capability (to reason and learn online) will hurt capabilities, so you’d have a huge uphill battle in keeping it out of deployed AGI. At least one current projects have already accomplished limited (but impressive) forms of this as part of their strategy to create useful LM agents. So I don’t think it’s getting rolled back or left out.
- Noosphere89 10 Sep 2024 22:39 UTC
  4 points
  0
  Parent
  I agree with you that there are probably better methods to handle the misuse risk, and note I also pointed out them as options, not exactly guarantees.
  
  And yeah, I agree with this specifically:
  
  but I don’t think it has to—there’s a whole suite of other alignment techniques for language model agents that should suffice together.
  
  Thanks for mentioning that.
  
  Now that I think about it, I agree that it’s only a stop gap for misuse, and yeah if there is even limited generalization ability, I agree that LLMs will be able to rediscover dangerous knowledge, so we will need to make LLMs that don’t let users completely make bio-weapons for example.