Charlie Steiner comments on Self-Supervised Learning and AGI Safety

Charlie Steiner 10 Aug 2019 6:41 UTC
7 points
This is definitely an interesting topic, and I’ll eventually write a related post, but here are my thoughts at the moment.
1 - I agree that using natural language prompts with systems trained on natural language makes for a much easier time getting common-sense answers. A particular sort of idiot-proofing that prevents the hypothetical idiot from having the AI tell them how to blow up the world. You use the example of “How would we be likely to cure Alzheimer’s?”—but for a well-trained natural language Oracle, you could even ask “How should we cure Alzheimer’s?”
If it was an outcome pump with no particular knowledge of humans, it would give you a plan that would set off our nuclear arsenals. A superintelligent search process with an impact penalty would tell you how to engineer a very unobtrusive virus. A perfect world model with no special knowledge of humans would tell you a series of configurations of quantum fields. These are all bad answers.
What you want the Oracle to tell you is the sort of plan that might practically be carried out, or some other useful information, that leads to an Alheimer cure in the normal way that people mean when talking about diseases and research and curing things. Any model that does a good job predicting human natural language will take this sort of thing for granted in more or less the way you want it to.
2 - But here’s the problem with curing Alzheimer’s: it’s hard. If you train GPT-3 on a bunch of medical textbooks and prompt it to tell you a cure for Alzheimer’s, it won’t tell you a cure, it will tell you what humans have said about curing Alzheimer’s.
If you train a simultaneous model (like a neural net or a big transformer or something) of human words, plus sensor data of the surrounding environment (like how an image captioning ai can be thought of as having a simultaneous model of words and pictures), and figure out how to control the amount of detail of verbal output, you might be able to prompt an AI with text about an Alzheimer’s cure, have it model a physical environment that it expects those words to take place in, and then translate that back into text describing the predicted environment in detail. But it still wouldn’t tell you a cure. It would just tell you a plausible story about a situation related to the prompt about curing Alzheimer’s, based on its training data. Rather than a logical Oracle, this image-captioning-esque scheme would be an intuitive Oracle, telling you things that make sense based on associations already present within the training set.
What am I driving at here, by pointing out that curing Alzheimer’s is hard? It’s that the designs above are missing something, and what they’re missing is search.
I’m not saying that getting a neural net to directly output your cure for Alzheimer’s is impossible. But it seems like it requires there to already be a “cure for Alzheimer’s” dimension in your learned model. The more realistic way to find the cure for Alzheimer’s, if you don’t already know it, is going to involve lots of logical steps one after another, slowly moving through a logical space, narrowing down the possibilities more and more, and eventually finding something that fits the bill. In other words, solving a search problem.
So if your AI can tell you how to cure Alzheimer’s, I think either it’s explicitly doing a search for how to cure Alzheimer’s (or worlds that match your verbal prompt the best, or whatever), or it has some internal state that implicitly performs a search.
And once you realize you’re imagining an AI that’s doing search, maybe you should feel a little less confident in the idiot-proofness I talked about in section 1. Maybe you should be concerned that this search process might turn up the equivalent of adversarial examples in your representation.
3 - Whenever I see a proposal for an Oracle, I tend to try to jump to the end—can you use this Oracle to immediately construct a friendly AI? If not, why not?
A perfect Oracle would, of course, immediately give you FAI. You’d just ask it “what’s the code for a friendly AI?”, and it would tell you, and you would run it.
Can you do the same thing with this self-supervised Oracle you’re talking about? Well, there might be some problems.
One problem is the search issue I just talked about—outputting functioning code with a specific purpose is a very search-y sort of thing to do, and not a very big-ol’-neural-net thing to do, even moreso than outputting a cure for Alzheimer’s. So maybe you don’t fully trust the output of this search, or maybe there’s no search and your AI is just incapable of doing the task.
But I think this is a bit of a distraction, because the basic question is whether you trust this Oracle with simple questions about morality. If you think the AI is just regurgitating an average answer to trolley problems or whatever, should you trust it when you ask for the FAI’s code?
There’s an interesting case to be made for “yes, actually,” here, but I think most people will be a little wary. And this points to a more general problem with definitions—any time you care about getting a definition having some particularly nice properties beyond what’s most predictive of the training data, maybe you can’t trust this AI.
- Steven Byrnes 11 Aug 2019 2:34 UTC
  2 points
  Parent
  Thanks for this really helpful comment!!
  
  Search: I don’t think search is missing from self-supervised learning at all (though I’m not sure if GPT-2 is that sophisticated). In fact, I think it will be an essential, ubiquitous part of self-supervised learning systems of the future.
  
  So when you say “The proof of this theorem is _____”, and give the system a while to think about it, it uses the time to search through its math concept space, inventing new concepts and building new connections and eventually outputting its guess.
  
  Just because it’s searching doesn’t mean it’s dangerous. I was just writing code to search through a string for a substring...no big deal, right? A world-model is a complicated data structure, and we can search for paths through this data structure just like any other search problem. Then when a solution to the search problem is found, the result is (somehow) printed to the terminal. I would be generically concerned here about things like (1) The search algorithm “decides” to seize more computing power to do a better search, or (2) the result printed to the terminal is manipulative. But (1) seems unlikely here, or if not, just use a known search algorithm you understand! For (2), I don’t see a path by which that would happen, at least under the constraints I mentioned in the post. Or is there something else you had in mind?
  
  Going beyond human knowledge: When you write “it will tell you what humans have said”, I’m not sure what you’re getting at. I don’t think this is true even with text-only data. I see three requirements to get beyond what humans know:
  
  (1) System has optimization pressure to understand the world better than humans do
  
  (2) System is capable of understanding the world better than humans do
  
  (3) The interface to the model allows us to extract information that goes beyond what humans already know.
  
  I’m pretty confident in all three of these. For example, for (1), give the system a journal article that says “We looked at the treated cell in the microscope and it appeared to be ____”. The system is asked to predict the blank. It does a better job at this prediction task by understanding biology better and better, even after it understands biology better than any human. By the same token, for (3), just ask a similar question for an experiment that hasn’t yet been done. For (2), I assume we’ll eventually invent good enough algorithms for that. What’s your take?
  
  (I do agree that videos and images make it easier for the system to exceed human knowledge, but I don’t think it’s required. After all, blind people are able to have new insights.)
  
  Ethics & FAI: I assume that a self-supervised learning system would understand concepts in philosophy and ethics just like it understands everything else. I hope that, with the right interface, we can ask questions about the compatibility of our decisions with our professed principles, arguments for and against particular principles, and so on. I’m not sure we should expect or want an oracle to outright endorse any particular theory of ethics, or any particular vision for FAI. I think we should ask more specific questions than that. Outputting code for FAI is a tricky case because even a superintelligent non-manipulative oracle is not omniscient; it can still screw up. But it could be a big help, especially if we can ask lots of detailed follow-up questions about a proposed design and always get non-manipulative answers.
  
  Let me know if I misunderstood you, or any other thoughts, and thanks again!
  - Charlie Steiner 19 Aug 2019 19:28 UTC
    2 points
    Parent
    The search thing is a little subtle. It’s not that search or optimization is automatically dangerous—it’s that I think the danger is that search can turn up adversarial examples / surprising solutions.
    I mentioned how I think the particular kind of idiot-proofness that natural language processing might have is “won’t tell an idiot a plan to blow up the world if they ask for something else.” Well, I think that as soon as the AI is doing a deep search through outcomes to figure out how to make Alzheimer’s go away, you lose a lot of that protection and I think the AI is back in the category of Oracles that might tell an idiot a plan to blow up the world.
    Going beyond human knowledge
    You make some good points about even a text-only AI having optimization pressure to surpass humans. But for the example “GPT-3” system, even if it in some sense “understood” the cure for Alzheimer’s, it still wouldn’t tell you the cure for Alzheimer’s in response to a prompt, because it’s trying to find the continuation of the prompt with highest probability in the training distribution.
    The point isn’t about text vs. video. The point is about the limitations of trying to learn the training distribution.
    To the extent that understanding the world will help the AI learn the training distribution, in the limit of super-duper-intelligent AI it will understand more and more about the world. But it will filter that all through the intent to learn the training distribution. For example, if human text isn’t trustworthy on a certain topic, it will learn to not be trustworthy on that topic either.
    - Steven Byrnes 21 Aug 2019 14:13 UTC
      2 points
      Parent
      Thanks, that’s helpful!
      
      The way I’m currently thinking about it, if we have an oracle that gives superintelligent and non-manipulative answers, things are looking pretty good for the future. When you ask it to design a new drug, you also ask some follow-up questions like “How does the drug work?” and “If we deploy this solution, how might this impact the life of a typical person in 20 years time?” Maybe it won’t always be able to give great answers, but as long as it’s not trying to be manipulative, it seems like we ought to be able to use such a system safely. (This would, incidentally, entail not letting idiots use the system.)
      
      I agree that extracting information from a self-supervised learner is a hard and open problem. I don’t see any reason to think it’s impossible. The two general approaches would be:
      
      Manipulate the self-supervised learning environment somehow. Basically, the system is going to know lots of different high-level contexts in which the statistics of low-level predictions are different—think about how GPT-2 can imitate both middle school essays and fan-fiction. We would need to teach it a context in which we expect the text to reflect profound truths about the world, beyond what any human knows. That’s tricky because we don’t have any such texts in our database. But maybe if we put a special token in the 50 most clear and insightful journal articles ever written, and then stick that same token in our question prompt, then we’ll get better answers. That’s just an example, maybe there are other ways.
      
      Forget about text prediction, and build an entirely separate input-output interface into the world model. The world model (if it’s vaguely brain-like) is “just” a data structure with billions of discrete concepts, and transformations between those concepts (composition, cause-effect, analogy, etc...probably all of those are built out of the same basic “transformation machinery”). All these concepts are sitting in the top layer of some kind of hierarchy, whose lowest layer consists of probability distributions over short snippets of text (for a language model, or more generally whatever the input is). So that’s the world model data structure. I have no idea how to build a new interface into this data structure, or what that interface would look like. But I can’t see why that should be impossible...