Charlie Steiner comments on Self-Supervised Learning and AGI Safety

Charlie Steiner 19 Aug 2019 19:28 UTC
2 points
The search thing is a little subtle. It’s not that search or optimization is automatically dangerous—it’s that I think the danger is that search can turn up adversarial examples / surprising solutions.
I mentioned how I think the particular kind of idiot-proofness that natural language processing might have is “won’t tell an idiot a plan to blow up the world if they ask for something else.” Well, I think that as soon as the AI is doing a deep search through outcomes to figure out how to make Alzheimer’s go away, you lose a lot of that protection and I think the AI is back in the category of Oracles that might tell an idiot a plan to blow up the world.
Going beyond human knowledge
You make some good points about even a text-only AI having optimization pressure to surpass humans. But for the example “GPT-3” system, even if it in some sense “understood” the cure for Alzheimer’s, it still wouldn’t tell you the cure for Alzheimer’s in response to a prompt, because it’s trying to find the continuation of the prompt with highest probability in the training distribution.
The point isn’t about text vs. video. The point is about the limitations of trying to learn the training distribution.
To the extent that understanding the world will help the AI learn the training distribution, in the limit of super-duper-intelligent AI it will understand more and more about the world. But it will filter that all through the intent to learn the training distribution. For example, if human text isn’t trustworthy on a certain topic, it will learn to not be trustworthy on that topic either.
- Steven Byrnes 21 Aug 2019 14:13 UTC
  2 points
  Parent
  Thanks, that’s helpful!
  
  The way I’m currently thinking about it, if we have an oracle that gives superintelligent and non-manipulative answers, things are looking pretty good for the future. When you ask it to design a new drug, you also ask some follow-up questions like “How does the drug work?” and “If we deploy this solution, how might this impact the life of a typical person in 20 years time?” Maybe it won’t always be able to give great answers, but as long as it’s not trying to be manipulative, it seems like we ought to be able to use such a system safely. (This would, incidentally, entail not letting idiots use the system.)
  
  I agree that extracting information from a self-supervised learner is a hard and open problem. I don’t see any reason to think it’s impossible. The two general approaches would be:
  1. Manipulate the self-supervised learning environment somehow. Basically, the system is going to know lots of different high-level contexts in which the statistics of low-level predictions are different—think about how GPT-2 can imitate both middle school essays and fan-fiction. We would need to teach it a context in which we expect the text to reflect profound truths about the world, beyond what any human knows. That’s tricky because we don’t have any such texts in our database. But maybe if we put a special token in the 50 most clear and insightful journal articles ever written, and then stick that same token in our question prompt, then we’ll get better answers. That’s just an example, maybe there are other ways.
  2. Forget about text prediction, and build an entirely separate input-output interface into the world model. The world model (if it’s vaguely brain-like) is “just” a data structure with billions of discrete concepts, and transformations between those concepts (composition, cause-effect, analogy, etc...probably all of those are built out of the same basic “transformation machinery”). All these concepts are sitting in the top layer of some kind of hierarchy, whose lowest layer consists of probability distributions over short snippets of text (for a language model, or more generally whatever the input is). So that’s the world model data structure. I have no idea how to build a new interface into this data structure, or what that interface would look like. But I can’t see why that should be impossible...