tailcalled comments on RLHF is the worst possible thing done when facing the alignment problem

tailcalled 20 Sep 2024 7:28 UTC
4 points
0

Yeah, but the point is that the system learns values before an unrestricted AI vs AI conflict.

But if you just naively take the value that are appropriate outside of a life-and-death conflict and apply them to a life-and-death conflict, you’re gonna lose. In that case, RLHF just makes you an irrelevant player, and if you insist on applying it to military/police technology, it’s necessary for AI safety to pivot to addressing rogue states or gangsters.

Which again makes RLHF really really bad because we shouldn’t have to work with rogue states or gangsters to save the world. Don’t cripple the good guys.

I mean, if your definition of values doesn’t make sense for real systems, then it’s the problem of your definition. As a hypothesis describing reality “alignment trait makes AI not splash harm on humans” is coherent enough. So the question is how do you know it is unlikely to happen?

If you propose a particular latent variable that acts in a particular way, that is a lot of complexity, and you need a strong case to justify it as likely.

First, “alignment is easy” is compatible with “we need to keep the set of big adversaries small”. But more generally, without numbers it seems like generalized anti-future-technology argument—what’s stopping human-regulation mechanisms from solving this adversarial problem, that didn’t stop them from solving previous adversarial problems?

Human-regulation mechanisms could plausibly solve this problem by banning chip fabs. The issue is we use chip fabs for all sorts of things so we don’t want to do that unless we are truly desperate.

Not necessary? It’s not unconceivable for future defense being more effective than offence (trivially true if “defense” is not giving AI to attackers). It kind of required for any future where humans have more power, than in present day?

Idk. Big entities have a lot of security vulnerabilities which could be attacked by AIs. But I guess one could argue the surviving big entities are red-teaming themselves hard enough to be immune to these. Perhaps most significant is the interactions between multiple independent big things, since they could be manipulated to harm the big things.

Small adversaries currently have a hard time exploiting these security vulnerabilities because intelligence is really expensive, but once intelligence becomes too cheap to meter, that is less of a problem.

You could heavily restrict the availability of AI but this would be an invasive possibility that’s far off the current trajectory.