I’m kinda low on bandwidth, so I might not engage with this further. But in any case, thanks for trying to share parts of your model!
This will be my last comment here. Thank you for trying to explain why you disagree with me!
IIUC, a central theme here is the belief that {making learning offline vs online} and {limiting AI-human interfaces to be simple/understood} would solve large chunks of the whole alignment problem, or at least make it much easier.
I”m impressed that you passed my ITT. I think analogies to other alignment problems like the human alignment problem miss that it’s the most difficult setting, but you don’t need to play on that difficulty, because AI is very different from humans.
AFAICT, PHF doesn’t solve any of the core problems of alignment.
While I definitely over claimed on how much it solves the alignment problems, I think this is definitely underselling the accomplishments. It’s an incomplete solution, in that it doesn’t do everything on it’s own, but it does carry a lot of weight.
To talk about deceptive alignment more specifically, deceptive alignment is essentially where the AI isn’t aligned with human goals, and tries to hide that fact. One of the key prerequisites of deceptive alignment is that it is optimizing a non-myopic goal. It’s the most dangerous form of alignment, since we have an AI only aligned for instrumental, not terminal reasons.
What Pretraining from Human Feedback did was it finally married a myopic goal with competitive capabilities, and once the myopic goal of conditional training was added, then deceptive alignment goes away, since a non-myopic goal being optimized is one of the key prerequisites.
I don’t see how this follows. IIUC, the proposition here is something like
If the AI only interacts with the humans via a simple, well-defined, and thoroughly understood interface, then the AI can’t hack the humans.
Is that a reasonable representation of what you’re
saying? If yes, consider: What if we replace “the AI” with “Anonymous” and “the humans” with “the web server”? Then we get
If Anonymous only interacts with the web server via a simple, well-defined, and thoroughly understood interface, then Anonymous can’t hack the web server
...which is obviously false in the general case, right? Systems can definitely be hackable, even if interactions with them are limited to a simple interface; as evidence, we could consider any software exploit ever that didn’t rely on hardware effects like rowhammering.
This is definitely right, and I did over claim here, though I do remember Pretraining from Human Feedback claimed to do this:
Conditional training (as well as other PHF objectives) is purely offline: the LM is not able to affect its own training distribution. This is unlike RLHF, where the LM learns from self-generated data and thus is more likely to lead to risks from auto-induce distribution shift or gradient hacking.
Which vindicated a narrower claim about the inability of the AI to hack or affect the training distribution, which I don’t know how much it supports my thesis on the immunity to hacking claims.
To port another reason why I’m so optimistic on alignment, I think that alignment is scalable, or put it another way, while Pretraining from Human Feedback is imperfect right now, and even in the imperfections my view is that it would avoid X-risk almost entirely, small, consistent improvements in the vein of empirical work will eventually make it far more aligned than the original Pretraining from Human Feedback work. In the case of more data, they tested it and it showed increasing alignment with more data.
To edit a quote from Thoth Hermes:
Yudkowsky was wrong in the tendency to assume that certain abstractions just don’t apply whenever intelligence or capability is scaled way up.”
This essentially explains my issues with the idea that alignment isn’t scalable.
This will be my last comment here. Thank you for trying to explain why you disagree with me!
I”m impressed that you passed my ITT. I think analogies to other alignment problems like the human alignment problem miss that it’s the most difficult setting, but you don’t need to play on that difficulty, because AI is very different from humans.
While I definitely over claimed on how much it solves the alignment problems, I think this is definitely underselling the accomplishments. It’s an incomplete solution, in that it doesn’t do everything on it’s own, but it does carry a lot of weight.
To talk about deceptive alignment more specifically, deceptive alignment is essentially where the AI isn’t aligned with human goals, and tries to hide that fact. One of the key prerequisites of deceptive alignment is that it is optimizing a non-myopic goal. It’s the most dangerous form of alignment, since we have an AI only aligned for instrumental, not terminal reasons.
What Pretraining from Human Feedback did was it finally married a myopic goal with competitive capabilities, and once the myopic goal of conditional training was added, then deceptive alignment goes away, since a non-myopic goal being optimized is one of the key prerequisites.
This is definitely right, and I did over claim here, though I do remember Pretraining from Human Feedback claimed to do this:
Which vindicated a narrower claim about the inability of the AI to hack or affect the training distribution, which I don’t know how much it supports my thesis on the immunity to hacking claims.
To port another reason why I’m so optimistic on alignment, I think that alignment is scalable, or put it another way, while Pretraining from Human Feedback is imperfect right now, and even in the imperfections my view is that it would avoid X-risk almost entirely, small, consistent improvements in the vein of empirical work will eventually make it far more aligned than the original Pretraining from Human Feedback work. In the case of more data, they tested it and it showed increasing alignment with more data.
To edit a quote from Thoth Hermes:
This essentially explains my issues with the idea that alignment isn’t scalable.