Daniel Kokotajlo answers If AI is based on GPT, how to ensure its safety?

Daniel Kokotajlo 19 Jun 2020 10:34 UTC
9 points
One would hope that GPT-7 would achieve accurate predictions about what humanoids do because it is basically a human. It’s algorithm is “OK, what would a typical human do?”
However, another possibility is that GPT-7 is actually much smarter than a typical human in some sense—maybe it has a deep understanding of all the different kinds of humans, rather than just a typical human, and maybe it has some sophisticated judgment for which kind of human to mimic depending on the context. In this case it probably isn’t best understood as a set of humans with an algorithm to choose between them, but rather something alien and smarter than humans that mimics them in the way that e.g. a human actress might some large set of animals.
Using Evan’s classification, I’d say that we don’t know how training-competitive GPT-7 is but that it’s probably pretty good on that front; GPT-7 is probably not very performance-competitive because even if all goes well it just acts like a typical human; GPT-7 has the standard inner alignment issues (what if it is deceptively aligned? What if it actually does have long-term goals, and pretends not to, since it realizes that’s the only way to achieve them? though perhaps they have less force since its training is so… short-term? I forget the term) and finally I think the issue pointed to with “The universal prior is malign” (i.e. probable environment hacking) is big enough to worry about here.
In light of all this, I don’t know how to ensure its safety. I would guess that some of the techniques Evan talks about might help, but I’d have to go through them and refamiliarize myself with them.