RLHF does not solve the alignment problem because humans can’t provide good-enough feedback fast-enough.
Yeah, but the point is that the system learns values before an unrestricted AI vs AI conflict.
As mentioned in the beginning, I think the intuition goes that neural networks have a personality trait which we call “alignment”, caused by the correspondence between their values and our values. But “their values” only really makes sense after an unrestricted AI vs AI conflict, since without such conflicts, AIs are just gonna propagate energy to whichever constraints we point them at, so this whole worldview is wrong.
I mean, if your definition of values doesn’t make sense for real systems, then it’s the problem of your definition. As a hypothesis describing reality “alignment trait makes AI not splash harm on humans” is coherent enough. So the question is how do you know it is unlikely to happen?
This has not lead to the destruction of humanity yet because the biggest adversaries have kept their conflicts limited (because too much conflict is too costly) so no entity has pursued an end by any means necessary. But this only works because there’s a sufficiently small number of sufficiently big adversaries (USA, Russia, China, …), and because there’s sufficiently much opportunity cost.
First, “alignment is easy” is compatible with “we need to keep the set of big adversaries small”. But more generally, without numbers it seems like generalized anti-future-technology argument—what’s stopping human-regulation mechanisms from solving this adversarial problem, that didn’t stop them from solving previous adversarial problems?
It makes conflict more viable for small adversaries against large adversaries
Not necessary? It’s not unconceivable for future defense being more effective than offence (trivially true if “defense” is not giving AI to attackers). It kind of required for any future where humans have more power, than in present day?
Yeah, but the point is that the system learns values before an unrestricted AI vs AI conflict.
But if you just naively take the value that are appropriate outside of a life-and-death conflict and apply them to a life-and-death conflict, you’re gonna lose. In that case, RLHF just makes you an irrelevant player, and if you insist on applying it to military/police technology, it’s necessary for AI safety to pivot to addressing rogue states or gangsters.
Which again makes RLHF really really bad because we shouldn’t have to work with rogue states or gangsters to save the world. Don’t cripple the good guys.
I mean, if your definition of values doesn’t make sense for real systems, then it’s the problem of your definition. As a hypothesis describing reality “alignment trait makes AI not splash harm on humans” is coherent enough. So the question is how do you know it is unlikely to happen?
If you propose a particular latent variable that acts in a particular way, that is a lot of complexity, and you need a strong case to justify it as likely.
First, “alignment is easy” is compatible with “we need to keep the set of big adversaries small”. But more generally, without numbers it seems like generalized anti-future-technology argument—what’s stopping human-regulation mechanisms from solving this adversarial problem, that didn’t stop them from solving previous adversarial problems?
Human-regulation mechanisms could plausibly solve this problem by banning chip fabs. The issue is we use chip fabs for all sorts of things so we don’t want to do that unless we are truly desperate.
Not necessary? It’s not unconceivable for future defense being more effective than offence (trivially true if “defense” is not giving AI to attackers). It kind of required for any future where humans have more power, than in present day?
Idk. Big entities have a lot of security vulnerabilities which could be attacked by AIs. But I guess one could argue the surviving big entities are red-teaming themselves hard enough to be immune to these. Perhaps most significant is the interactions between multiple independent big things, since they could be manipulated to harm the big things.
Small adversaries currently have a hard time exploiting these security vulnerabilities because intelligence is really expensive, but once intelligence becomes too cheap to meter, that is less of a problem.
You could heavily restrict the availability of AI but this would be an invasive possibility that’s far off the current trajectory.
Yeah, but the point is that the system learns values before an unrestricted AI vs AI conflict.
I mean, if your definition of values doesn’t make sense for real systems, then it’s the problem of your definition. As a hypothesis describing reality “alignment trait makes AI not splash harm on humans” is coherent enough. So the question is how do you know it is unlikely to happen?
First, “alignment is easy” is compatible with “we need to keep the set of big adversaries small”. But more generally, without numbers it seems like generalized anti-future-technology argument—what’s stopping human-regulation mechanisms from solving this adversarial problem, that didn’t stop them from solving previous adversarial problems?
Not necessary? It’s not unconceivable for future defense being more effective than offence (trivially true if “defense” is not giving AI to attackers). It kind of required for any future where humans have more power, than in present day?
But if you just naively take the value that are appropriate outside of a life-and-death conflict and apply them to a life-and-death conflict, you’re gonna lose. In that case, RLHF just makes you an irrelevant player, and if you insist on applying it to military/police technology, it’s necessary for AI safety to pivot to addressing rogue states or gangsters.
Which again makes RLHF really really bad because we shouldn’t have to work with rogue states or gangsters to save the world. Don’t cripple the good guys.
If you propose a particular latent variable that acts in a particular way, that is a lot of complexity, and you need a strong case to justify it as likely.
Human-regulation mechanisms could plausibly solve this problem by banning chip fabs. The issue is we use chip fabs for all sorts of things so we don’t want to do that unless we are truly desperate.
Idk. Big entities have a lot of security vulnerabilities which could be attacked by AIs. But I guess one could argue the surviving big entities are red-teaming themselves hard enough to be immune to these. Perhaps most significant is the interactions between multiple independent big things, since they could be manipulated to harm the big things.
Small adversaries currently have a hard time exploiting these security vulnerabilities because intelligence is really expensive, but once intelligence becomes too cheap to meter, that is less of a problem.
You could heavily restrict the availability of AI but this would be an invasive possibility that’s far off the current trajectory.