Ajeya Cotra comments on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover

Ajeya Cotra 19 Jul 2022 17:28 UTC
4 points
2
All these drives do seem likely. But that’s different from arguing that “help humans” isn’t likely. I tend to think of the final objective function being some accumulation of all of these, with a relatively significant chunk placed on “help humans” (since in training, that will consistently overrule other considerations like “be more efficient” when it comes to the final reward).

I think that by the logic “heuristic / drive / motive X always overrules heuristic / drive / motive Y when it comes to final reward,” the hierarchy is something like:
1. The drive / motive toward final reward (after all edits—see previous comment) or anything downstream of that (e.g. paperclips in the universe).
2. Various “pretty good” drives / motives among which “help humans” could be one.
3. Drives / motives that are only kind of helpful or only helpful in some situations.
4. Actively counterproductive drives / motives.
In this list the earlier motives always overrule later motives when they conflict, because they are more reliable guides to the true reward. Even if “be genuinely helpful to humans” is the only thing in category 2, or the best thing in category 2, it’s still overruled by category 1 -- and category 1 is quite big because it includes all the caring-about-long-run-outcomes-in-the-real-world motives.

I still think AI psychology will be quite messy and at least the first generation of transformative AI systems will not look like clean utility maximizers, but the basic argument above I think gives a positive reason to expect honesty / corrigibility plays a smaller role in the balance of AI motivations than reward-maximizing and inner misaligned motives.