RomanS comments on Christiano, Cotra, and Yudkowsky on AI progress

RomanS 26 Nov 2021 15:10 UTC
4 points
One way they could do that, is by pitting the model against modified versions of itself, like they did in OpenAI Five (for Dota).
From the minimizing-X-risk perspective, it might be the worst possible way to train AIs.
As Jeff Clune (Uber AI) put it:
[O]ne can imagine that some ways of configuring AI-GAs (i.e. ways of incentivizing progress) that would make AI-GAs more likely to succeed in producing general AI also make their value systems more dangerous. For example, some researchers might try to replicate a basic principle of Darwinian evolution: that it is ‘red in tooth and claw.’
If a researcher tried to catalyze the creation of an AI-GA by creating conditions similar to those on Earth, the results might be similar. We might thus produce an AI with human vices, such as violence, hatred, jealousy, deception, cunning, or worse, simply because those attributes make an AI more likely to survive and succeed in a particular type of competitive simulated world. Note that one might create such an unsavory AI unintentionally by not realizing that the incentive structure they defined encourages such behavior.
Additionally, if you train a language model to outsmart millions of increasingly more intelligent copies of itself, you might end up with the perfect AI-box escape artist.