adamShimi comments on Transparency and AGI safety

adamShimi 13 Jan 2021 18:26 UTC
LW: 6 AF: 3
AF
Thanks a lot for all the effort you put into this post! I don’t agree with everything, but reading and commenting it was very stimulating, and probably useful for my own research.
In this post, I’ll argue that making AI systems more transparent could be very useful from a longtermist or AI safety point of view. I’ll then review recent progress on this centered around the circuits program being pursued by the Clarity team at OpenAI, and finally point out some directions for future work that could be interesting to pursue.
I’m quite curious about why you wrote this post. If it’s for convincing researchers in AI Safety that transparency is useful and important for AI Alignment, my impression is that many researchers do agree, and those who don’t tend to have thought about it for quite some time (Paul Christiano comes to mind, as someone who is less interested in transparency while knowing a decent amount about it). So if the goal was to convince people to care about transparency, I’m not sure this post was necessary.
I’m not saying I don’t find value this post. As a big fan of the circuit research, I’m glad to have more in-depth discussion about and around it. I am simply trying to understand what you wanted to do with this post, to give you better feedback.
Artificial general intelligence (AGI) is usually more vaguely defined as an AI system that can do anything that humans can do. Here, I’ll operationally take it to mean “AI that can outperform humans at the task of generating qualitative insights into technical AI safety research,” for instance by coming up with new research agendas that turn out to be fruitful.
Nitpicking here, but I assume you mean coming up with a high enough proportion of new research agendas, instead of just coming up with some. That changes removes stupid edge cases like programs writing all the permutations of some sentences about AGI, which would probably generate at least some useful ideas among the noise.
To summarize, even if one assumes the UAT in the current deep learning paradigm, the choice of model initialization + dataset that would let us get to AGI may be highly non-generic. To the extent that this is true, it pushes towards (indefinitely) longer timelines than forecasted in analyses based on compute, since practitioners might then have to understand an unknown number of qualitatively new things around setting initial conditions for the search problem.
I agree with the idea, with maybe the caveat that it doesn’t apply to Ems à la Hanson. A similar argument could hold about neuroscience facts we would need to know to scan and simulate brains, though.
Motivation #1: Work on transparency could help to reduce this uncertainty
This leads to a first motivation for transparency research: that getting a better understanding of how today’s AI systems work seems useful to let us make better-educated guesses about how they might scale up.
For example, learning more about how GPT-3 works seems like it could help us to reason better about whether the task of text prediction by itself could ever lead to AGI, which competes with the hard paths hypothesis. (For examples of the type of insights that we might hope to gain from applying transparency tools, see Part 2 of this note below.)
This argument applies to every part of ML that studies how learned model works and why. So as itself, it’s insufficient for privileging transparency over theoretical work on neural nets, for example.
From the machine learning (“lobotomized alien”) point of view, a natural way to partition the alignment problem is as
- an outer alignment (specification) problem of making sure that our systems are designed with utility / loss functions that would make them do what we intend for them to do in theory, and
- an inner alignment (distribution shift) problem of making sure that a system trained on a “theoretically correct” objective with a finite-sized training set would keep on pursuing that objective when deployed in a somewhat different environment than the one that it trained on.
From the more anthropomorphic “alien in a box” point of view on the other hand, one might instead find it natural to slice up the alignment problem into
- a competence (translation) problem of making the AI system learn to understand what we want, and
- an intent alignment problem of making the AI system care to do what we want, assuming that it understands us perfectly well.
I really like the way you present the two point of view and how they partition the alignment problem. It’s going to be quite useful for me. Notably, I almost always take the “lobotomized alien” perspective, but now I can remind myself to make that a choice and see if the “alien in a box” perspective is more appropriate. Thanks!
Motivation #2: Transparency seems necessary to guard against emergent misbehavior
This leads to a second motivation for transparency research, which is that to defend against emergent misbehavior in all situations that an agentic AGI could encounter when deployed, it seems necessary to me that we understand something about the AI’s internal cognition.
I completely agree with this motivation, and it is really well presented.
But we can’t guarantee ahead of time that adversarial training will catch every failure mode, and verification requires that we characterize the space of possible inputs, which seems hard to scale up to future AI systems with arbitrarily large input/output spaces [^9]. So this is in no way a proof but is a failure of my imagination otherwise (and I’d be very excited to hear about other ideas!).
My take on why verification might scale is that we will move towards specification of properties of the program instead of it’s input/output relation. So verifying whether the code satisfy some formal property that indicates myopia or low goal-directedness. Note that transparency is still really important here, because even with completely formal definitions of things like myopia and goal-directedness, I think transparency will be necessary to translate them into properties of the specific class of models studied (neural networks for example).
A third motivation is that exact transparency would give us a mulligan: a chance to check if something could go catastrophically wrong with a system that we’ve built before we decide to deploy it in the real world. E.g. suppose that just by looking at the weights of a neural network, we could read off all of the knowledge encoded inside the network. Then you could imagine looking into the “mind” of a paperclip-making AI system, seeing that for some reason it had been learning things related to making biological weapons, and deciding against letting it run your paperclip-making factory.
I think this misses a very big part of what makes a paperclip-maximizer dangerous—the fact that it can come up with catastrophic plans after it’s been deployed. So it doesn’t have to be explicitly deceptive and bidding it’s time; it might just be really competent and focused on maximizing paperclips, which requires more than exact transparency to catch. It requires being able to check properties that ensures the catastrophic outcomes won’t happen.
But I still think your motivation makes sense for a part of deceptive alignment. My more general caveat is that I don’t believe in exact transparency, so I am more for a mixed transparency and verification approach (as mentioned above).
A minimal AI system that can write blog posts about AI safety, or otherwise do theoretical science research, doesn’t seem to require a large output space. It plausibly just needs to be able to write text into an offline word processor. This suggests that the first AGI may be close to what people have historically called an “Oracle AI”.
In my opinion, Oracle AIs already seem pretty safe by virtue of being well-boxed, without further qualification. If all they can do is write offline text, they would have to go through humans to cause an existential catastrophe. However, some might argue that a hypothetical Oracle AI that was very proficient at manipulating humans could trick its human handlers into taking dangerous actions on its behalf. So to strengthen the case, we should also appeal to selection pressure.
An AI who does AI Safety research is properly terrifying. I’m really stunned by this choice, as I think this is probably one of the most dangerous case of oracle AI (and oracle AI is a pretty dangerous class by itself) that I can think of. I see two big problems with it:
- It looks like exactly the kind of tasks where, if we haven’t solve AI alignment in advance, Goodhart is upon us. What’s the measure? What’s the proxy? Best scenario: the AI is clearly optimizing something stupid, and nobody cares. Worst case scenario, more probably because the AI is actually supposed to outperform humans: it pushes for something that looks like it makes sense but doesn’t actually work, and we might use these insight to build more advanced AGIs and be fucked.
- It’s quite simple to imagine a Predict-o-matic type scenario: pushing simpler and easier models that appear to work but don’t, so that its task becomes easier.
To finish this argument, we would need to characterize what’s needed to do AI safety research and argue that there exists a limited curriculum to impart that knowledge that wouldn’t lead to deceptive oracle AI. I don’t have a totally satisfactory argument for this (hence the earlier caveat!), but one bit of intuition in this direction is that the transparency agenda in this rest of this document certainly doesn’t require deep (or any) knowledge of humans. The same seems true of at least a subset of other safety agendas, and we need only argue that AI that could accelerate progress in some parts of technical AI safety or otherwise change how we do intellectual work will come before plausibly dangerous agent AIs to reconsider how much to invest in object-level AI safety work today (since then it might make sense to defer some of the work to future researchers). We don’t need to prove that a safe AGI oracle would solve the entire problem of AGI safety in one go.
I don’t think any of the intuitions given work, for a simple reason: even if the research agenda doesn’t require in itself any real knowledge of humans, the outputs still have to be humanly understandable. I want the AI to write blog posts that I can understand. So it will have to master clear writing, which seems from experience to require a lot of modeling of the other (and as a human, I get a bunch of things for free unconsciously, that an AI wouldn’t have, like a model of emotions).
Another issue with this proposal is that you’re saying on one side that the AI is superhuman at technical AI safety, and on the other hand that it can only do these specific proposals that don’t use anything about humans. That’s like saying that you have an AI that wins at any game, but in fact it only works for chess. Either the AI can do research on everything in AI Safety, and it will probably have to understand humans; or it is specifically for one research proposal, but then I don’t see why not create other AIs for other research proposals. The technology is available, and the incentives would be here (if only to be as productive as the other researchers who have an AI to help them).
Motivation #4: Work on transparency could still be instrumentally valuable in such a world
Even in such a world though, there are some non-AGI-safety reasons that transparency research could be well-motivated today.
I disagree with the previous argument, yet I find this motivation really useful, because if it’s also useful when things go correctly, that’s a good way to have people unconvinced by risks work on it.
Notably, I believe in pushing new entrants who want do to AI (and are not interested or ready to switch to AI Alignment) towards transparency, as this is a really useful subfield for alignment and it’s one of the parts of AI that push capabilities the less.
Summary of the technical approach
Even for someone who read most of the circuits paper, I found this summary really clear and insightful. You might actually be where I redirect people for getting an idea of circuits!
One way to get some insight into these things might be to edit the network. To test the first thing, we could delete the “color detectors” and see how badly that degrades the performance of the black-and-white detector at finding black-and-white images, while to test the second one, we could delete the black-and-white circuit from InceptionV1, and see how badly that degrades the performance of InceptionV1 at transfer learning on the task of black-and-white vs. color image classification [^11]. It might be interesting to develop quantitative standards for checks along these lines.
These looks great! I hadn’t thought about the issue you mention according to modularity, but it seems really important to settle, and your proposals are ingenious ways to study the question.
Compare circuits work to existing work on loss landscapes in deep learning. Another strategy might be to go through the literature of existing results from other perspectives and look for synergies with the circuits approach. As a semi-random example, the linked paper in the previous bullet-point suggests that some directions in the loss landscape are more important than others; it might be interesting to understand if such directions play an interesting role from the circuits POV.
I’m especially interested in this direction, as it seems highly relevant to my own research on gradient hacking.
- jylin04 14 Jan 2021 14:42 UTC
  LW: 5 AF: 2
  AF Parent
  Thanks a lot for all the effort you put into this post! I don’t agree with anything, but reading and commenting it was very stimulating, and probably useful for my own research.
  Likewise, thanks for taking the time to write such a long comment! And hoping that’s a typo in the second sentence :)
  I’m quite curious about why you wrote this post. If it’s for convincing researchers in AI Safety that transparency is useful and important for AI Alignment, my impression is that many researchers do agree, and those who don’t tend to have thought about it for quite some time (Paul Christiano comes to mind, as someone who is less interested in transparency while knowing a decent amount about it). So if the goal was to convince people to care about transparency, I’m not sure this post was necessary.
  Fair enough! Since I’m pretty new to thinking about this stuff, my main goal was to convince myself and organize my own thoughts around this topic. I find that writing a review is often a good way to get up to speed on something. Then once I’d written it, it seemed like I might as well post it somewhere.
  Wrt the community though, I’d be especially curious to get more feedback on Motivation #2. Do people not agree that transparency is *necessary* for AI Safety? And if they do agree, then why aren’t more people working on it?
  I agree with the idea, with maybe the caveat that it doesn’t apply to Ems à la Hanson. A similar argument could hold about neuroscience facts we would need to know to scan and simulate brains, though.
  Yeah, I’d add that if even we had a similar hardware-based forecast for mapping the human connectome, there would still be a lot that we don’t know about dynamics there too. I have the impression that basically all ways to forecast things in this space have to make some non-obvious (to me) assumption that business as usual will scale up to strong AI without a need for qualitative breakthroughs.
  My take on why verification might scale is that we will move towards specification of properties of the program instead of it’s input/output relation. So verifying whether the code satisfy some formal property that indicates myopia or low goal-directedness. Note that transparency is still really important here, because even with completely formal definitions of things like myopia and goal-directedness, I think transparency will be necessary to translate them into properties of the specific class of models studied (neural networks for example).
  I agree, but think that transparency is doing most of the work there (i.e. what you say sounds more to me like an application of transparency than scaling up the way that verification is used in current models.) But this is just semantics.
  I think this misses a very big part of what makes a paperclip-maximizer dangerous—the fact that it can come up with catastrophic plans after it’s been deployed. So it doesn’t have to be explicitly deceptive and bidding it’s time; it might just be really competent and focused on maximizing paperclips, which requires more than exact transparency to catch. It requires being able to check properties that ensures the catastrophic outcomes won’t happen.
  Hm, I want to disagree, but this may just come down to a difference in what we mean by deployment. In the paragraph that you quoted, I was imagining the usual train/deploy split from ML where deployment means that we’ve frozen the weights of our AI and prohibit further learning from taking place. In that case, I’d like to emphasize that there’s a difference between intelligence as a meta-ability to acquire new capabilities and a system’s actual capabilities at a given time. Even if an AI is superintelligent, i.e. able to write new information into its weights extremely efficiently, once those weights are fixed, it can only reason and plan using whatever object-level knowledge was encoded in them up to that point. So if there was nothing about bio weapons in the weights when we froze them, then we wouldn’t expect the paperclip-maximizer to spontaneously make plans involving bio weapons when deployed.
  On the other hand, none of this would apply to the “alien in a box” model that would basically be continuously training by my definition (though in that case, we could still patch the solution by monitoring the AI in real time). So maybe it was a poor choice of words.
  An AI who does AI Safety research is properly terrifying. I’m really stunned by this choice, as I think this is probably one of the most dangerous case of oracle AI that I can think of. I see two big problems with it:
  It looks like exactly the kind of tasks where, if we haven’t solve AI alignment in advance, Goodhart is upon us. What’s the measure? What’s the proxy? Best scenario: the AI is clearly optimizing something stupid, and nobody cares. Worst case scenario, more probably because the AI is actually supposed to outperform humans: it pushes for something that looks like it makes sense but doesn’t actually work, and we might use these insight to build more advanced AGIs and be fucked.
  It’s quite simple to imagine a Predict-o-matic type scenario: pushing simpler and easier models that appear to work but don’t, so that its task becomes easier.
  I don’t think any of the intuitions given work, for a simple reason: even if the research agenda doesn’t require in itself any real knowledge of humans, the outputs still have to be humanly understandable. I want the AI to write blog posts that I can understand. So it will have to master clear writing, which seems from experience to require a lot of modeling of the other (and as a human, I get a bunch of things for free unconsciously, that an AI wouldn’t have, like a model of emotions).
  These two comments seem related so let me reply to them together. I think what you’re asking here is “how can we be sure that a “research accelerator” AI, trained to help with a self-contained AI safety agenda such as transparency, will produce solutions that we can understand before we implement them [so as to avoid getting tricked into implementing something that turns out to be bad, as in your first quote]?” And I would answer that I’ve made an assumption that knowledge is universal and new ideas are discovered by incrementally building on existing ones. This is why basically any student today knows more about science than the smartest people from a century ago, and on the flip side, I think would constrain how far beyond us the insights from early AGIs trained on our work could be. Suppose an AI system was trained on a dataset of existing transparency papers to come up with new project ideas in transparency. Then its first outputs would probably use words like neurons and weights instead of some totally incomprehensible concepts, since those would be the very same concepts that would let it efficiently make sense of its training set. And new ideas about neurons and weights would then be things that we could independently reason about even if they’re very clever ideas that we didn’t think of ourselves, just like you and I can have a conversation about circuits even if we didn’t come up with it.
  Another issue with this proposal is that you’re saying on one side that the AI is superhuman at technical AI safety, and on the other hand that it can only do these specific proposals that don’t use anything about humans. That’s like saying that you have an AI that wins at any game, but in fact it only works for chess. Either the AI can do research on everything in AI Safety, and it will probably have to understand humans; or it is specifically for one research proposal, but then I don’t see why not create other AIs for other research proposals. The technology is available, and the incentives would be here (if only to be as productive as the other researchers who have an AI to help them).
  Agree that there’s a (strong!) assumption being made that “research accelerators for narrow agendas” will come before potentially dangerous AI systems. I think this might actually be a weak point of my story. Rohin asked something similar in the second bullet-point of his comment so I’ll try to answer there...
  - adamShimi 17 Jan 2021 19:18 UTC
    LW: 2 AF: 1
    AF Parent
    Likewise, thanks for taking the time to write such a long comment! And hoping that’s a typo in the second sentence :)
    You’re welcome. And yes, this was as typo that I corrected. ^^
    Wrt the community though, I’d be especially curious to get more feedback on Motivation #2. Do people not agree that transparency is *necessary* for AI Safety? And if they do agree, then why aren’t more people working on it?
    My take is that a lot of people around here agree that transparency is at least useful, and maybe necessary. And the main reason why people are not working on it is a mix of personal fit, and the fact that without research in AI Alignment proper, transparency doesn’t seem that useful (if we don’t know what to look for).
    I agree, but think that transparency is doing most of the work there (i.e. what you say sounds more to me like an application of transparency than scaling up the way that verification is used in current models.) But this is just semantics.
    Well, transparency is doing some work, but it’s totally unable to prove anything. That’s a big part of the approach I’m proposing. That being said, I agree that this doesn’t look like scaling the current way.
    Hm, I want to disagree, but this may just come down to a difference in what we mean by deployment. In the paragraph that you quoted, I was imagining the usual train/deploy split from ML where deployment means that we’ve frozen the weights of our AI and prohibit further learning from taking place. In that case, I’d like to emphasize that there’s a difference between intelligence as a meta-ability to acquire new capabilities and a system’s actual capabilities at a given time. Even if an AI is superintelligent, i.e. able to write new information into its weights extremely efficiently, once those weights are fixed, it can only reason and plan using whatever object-level knowledge was encoded in them up to that point. So if there was nothing about bio weapons in the weights when we froze them, then we wouldn’t expect the paperclip-maximizer to spontaneously make plans involving bio weapons when deployed.
    You’re right that I was thinking of a more online system that could update it’s weights during deployment. Yet even with frozen weights, I definitely expect the model to make plans involving things that were not involved. For example, it might not have a bio-weapon feature, but the relevant subfeature to build some by quite local rules that don’t look like a plan to build a bio-weapon.
    Suppose an AI system was trained on a dataset of existing transparency papers to come up with new project ideas in transparency. Then its first outputs would probably use words like neurons and weights instead of some totally incomprehensible concepts, since those would be the very same concepts that would let it efficiently make sense of its training set. And new ideas about neurons and weights would then be things that we could independently reason about even if they’re very clever ideas that we didn’t think of ourselves, just like you and I can have a conversation about circuits even if we didn’t come up with it.
    That seems reasonable.