This post identifies four different motivations for working on transparency:
1. By learning more about how current neural networks work, we can improve our forecasts for AI timelines.
2. It seems _necessary_ for inner alignment. In particular, whatever AI development model you take, it seems likely that there will be some possibility of emergent misbehavior, and there doesn’t yet seem to be a way to rule that out except via transparency.
3. A good solution to transparency would be _sufficient_ for safety, since we could at least notice when AI systems were misaligned, and then choose not to deploy them.
4. Even if AI will “go well by default”, there are still instrumental reasons for transparency, such as improving cause prioritization in EA (via point 1), and for making systems more capable and robust.
After reviewing work on <@circuits@>(@Thread: Circuits@), the post suggests a few directions for future research:
1. Investigating how modular neural networks tend to be,
2. Figuring out how to make transparency outputs more precise and less subjective,
3. Looking for circuits in other networks (i.e. not image classifiers), see e.g. <@RL vision@>(@Understanding RL Vision@),
4. Figuring out how transparency fits into an end-to-end story for AI safety.
Planned opinion:
I’m on board with motivations 2 and 3 for working on transparency (that is, that it directly helps with AI safety). I agree with motivation 4, but for a different reason than the post mentions. If by “AI goes well by default” we mean that AI systems are trying to help their users, that still leaves the AI governance problem. It seems that making these AI systems transparent would be significantly helpful; it would probably enable a “right to an explanation”, for example. I don’t agree with motivation 1 as much: if I wanted to improve AI timeline forecasts, there are a lot of other aspects I would investigate first. (Specifically, I’d improve estimates of inputs into <@this report@>(@Draft report on AI timelines@).) Part of this is that I am less uncertain than the author about the cruxes that transparency could help with, and so see less value in investigating them further.
Other comments (i.e. not going in the newsletter):
I feel like the most interesting parts of this post are exactly the parts I didn’t summarize. (I chose not to summarize them because they felt quite preliminary, where I couldn’t really extract a key argument or insight that I believed.) But some thoughts about them:
The “alien in a box” hypothetical made sense to me (mostly), but I didn’t understand the “lobotomized alien” hypothetical. I also didn’t see how this was meant to be analogous to machine learning. One concrete question: why are we assuming that we can separate out the motivational aspect of the brain? (That’s not my only confusion, but I’m having a harder time explaining other confusions.)
It feels like your non-agentic argument is too dependent on how you defined “AGI”. I can believe that the first powerful research accelerator will be limited to language, but that doesn’t mean that other AI systems deployed at the same time will be limited to language.
It seems like there’s a pretty clear argument for language models to be deceptive—the “default” way to train them is to have them produce outputs that humans like; this optimizes for being convincing to humans, which is not necessarily the same as being true. (However, it’s more plausible to me that the first such model won’t cause catastrophic risk, which would still be enough for your conclusions.)
Thanks Rohin! Agree with and appreciate the summary as I mentioned before.
I don’t agree with motivation 1 as much: if I wanted to improve AI timeline forecasts, there are a lot of other aspects I would investigate first. (Specifically, I’d improve estimates of inputs into <@this report@>(@Draft report on AI timelines@).) Part of this is that I am less uncertain than the author about the cruxes that transparency could help with, and so see less value in investigating them further.
I’m curious: does this mean that you’re on board with the assumption in Ajeya’s report that 2020 algorithms and datasets + “business as usual” in algorithm and dataset design will scale up to strong AI, with compute being the bottleneck? I feel both uncertain about this assumption and uncertain about how to update on it one way or the other. (But this probably belongs more in a discussion of that report and is kind of off topic here.)
The “alien in a box” hypothetical made sense to me (mostly), but I didn’t understand the “lobotomized alien” hypothetical. I also didn’t see how this was meant to be analogous to machine learning. One concrete question: why are we assuming that we can separate out the motivational aspect of the brain? (That’s not my only confusion, but I’m having a harder time explaining other confusions.)
A more concrete version of the “lobotomized alien” hypothetical might be something like this: There’s this neuroscience model that sometimes gets discussed around here that human cognition works by running some sort of generative model over the neocortex, with a loss function that’s modulated by stuff going on in the midbrain (see e.g. https://www.lesswrong.com/posts/diruo47z32eprenTg/my-computational-framework-for-the-brain). Suppose that you buy this theory, and now suppose that we’re the AIs being trained in a simulation by a more advanced alien race. Then one way that the aliens could try to get us to do stuff for them might be to reinstantiate just a human neocortex and train it from scratch on a loss function + dataset of their choice, as some sort of souped-up unsupervised learning algorithm.
In this example, I’m definitely just assuming by fiat that the cognition and motivation parts of the brain are well-separated (and moreover, that the aliens are able to discover this, say by applying some coarse-grained transparency tools). So it’s just a toy model for how things *could* go, not necessarily how they *will* go.
It feels like your non-agentic argument is too dependent on how you defined “AGI”. I can believe that the first powerful research accelerator will be limited to language, but that doesn’t mean that other AI systems deployed at the same time will be limited to language.
Hmm. I think I agree that this is a weak point of the argument and it’s not clear how to patch it. I think I had some intuition like, even once we have some sort of pretrained AGI algorithm (like an RL agent trained in simulation), we would have to fine-tune it on real-world tasks one at a time by coming up with a curriculum for each of those tasks; this seems easier to do for simple bounded tasks than for more open-ended ones (though in some sense that needs to be made more precise, and is maybe already assuming some things about alignment); and “research acceleration” seems like a much narrower task with a relatively well-defined training set of papers, books, etc. than “AI agent that competently runs a company”, so might still come first on those grounds. But even then it would have to come first by a large enough margin for insights from the research accelerator to actually be implemented, for this argument to work. So there’s at least a gap there...
It seems like there’s a pretty clear argument for language models to be deceptive—the “default” way to train them is to have them produce outputs that humans like; this optimizes for being convincing to humans, which is not necessarily the same as being true. (However, it’s more plausible to me that the first such model won’t cause catastrophic risk, which would still be enough for your conclusions.)
Yeah, fair enough. I should have said that I don’t see a path for language models to get selection pressure in the direction of being catastrophically deceptive like in the old “AI getting out of the box” stories, so I think we agree.
I’m curious: does this mean that you’re on board with the assumption in Ajeya’s report that 2020 algorithms and datasets + “business as usual” in algorithm and dataset design will scale up to strong AI, with compute being the bottleneck?
Yes, with the exception that I don’t know if compute will be the bottleneck (that is my best guess; I think Ajeya’s report makes a good case for it; but I could see it being other factors as well).
I think the case for is basically “we see a bunch of very predictable performance lines; seems like they’ll continue to go up”. But more importantly I don’t know of any compelling counterpoints; the usual argument seems to be “but we don’t see any causal reasoning / abstraction / <insert property here> yet”, which I think is perfectly compatible with the scaling hypothesis (see e.g. this comment).
A more concrete version of the “lobotomized alien” hypothetical might be something like this
I see, that makes sense, and I think it does make sense as an intuition pump for what the “ML paradigm” is trying to do (though as you sort of mentioned I don’t expect that we can just do the motivation / cognition decomposition).
“research acceleration” seems like a much narrower task with a relatively well-defined training set of papers, books, etc. than “AI agent that competently runs a company”, so might still come first on those grounds.
Definitely depends on how powerful you’re expecting the AI system to be. It seems like if you want to make the argument that AI will go well by default, you need the research accelerator to be quite powerful (or you have to combine with some argument like “AI alignment will be easy to solve”).
I don’t think papers, books, etc are a “relatively well-defined training set”. They’re a good source of knowledge, but if you imitate papers and books, you get a research accelerator that is limited by the capabilities of human scientists (well, actually much more limited, since it can’t run experiments). They might be a good source of pretraining data, but there would still be a lot of work to do to get a very powerful research accelerator.
I should have said that I don’t see a path for language models to get selection pressure in the direction of being catastrophically deceptive like in the old “AI getting out of the box” stories, so I think we agree.
Fwiw I’m not convinced that we avoid catastrophic deception either, but my thoughts here are pretty nebulous and I think that “we don’t know of a path to catastrophic deception” is a defensible position.
Planned summary for the Alignment Newsletter:
Planned opinion:
Other comments (i.e. not going in the newsletter):
I feel like the most interesting parts of this post are exactly the parts I didn’t summarize. (I chose not to summarize them because they felt quite preliminary, where I couldn’t really extract a key argument or insight that I believed.) But some thoughts about them:
The “alien in a box” hypothetical made sense to me (mostly), but I didn’t understand the “lobotomized alien” hypothetical. I also didn’t see how this was meant to be analogous to machine learning. One concrete question: why are we assuming that we can separate out the motivational aspect of the brain? (That’s not my only confusion, but I’m having a harder time explaining other confusions.)
It feels like your non-agentic argument is too dependent on how you defined “AGI”. I can believe that the first powerful research accelerator will be limited to language, but that doesn’t mean that other AI systems deployed at the same time will be limited to language.
It seems like there’s a pretty clear argument for language models to be deceptive—the “default” way to train them is to have them produce outputs that humans like; this optimizes for being convincing to humans, which is not necessarily the same as being true. (However, it’s more plausible to me that the first such model won’t cause catastrophic risk, which would still be enough for your conclusions.)
Thanks Rohin! Agree with and appreciate the summary as I mentioned before.
I’m curious: does this mean that you’re on board with the assumption in Ajeya’s report that 2020 algorithms and datasets + “business as usual” in algorithm and dataset design will scale up to strong AI, with compute being the bottleneck? I feel both uncertain about this assumption and uncertain about how to update on it one way or the other. (But this probably belongs more in a discussion of that report and is kind of off topic here.)
A more concrete version of the “lobotomized alien” hypothetical might be something like this: There’s this neuroscience model that sometimes gets discussed around here that human cognition works by running some sort of generative model over the neocortex, with a loss function that’s modulated by stuff going on in the midbrain (see e.g. https://www.lesswrong.com/posts/diruo47z32eprenTg/my-computational-framework-for-the-brain). Suppose that you buy this theory, and now suppose that we’re the AIs being trained in a simulation by a more advanced alien race. Then one way that the aliens could try to get us to do stuff for them might be to reinstantiate just a human neocortex and train it from scratch on a loss function + dataset of their choice, as some sort of souped-up unsupervised learning algorithm.
In this example, I’m definitely just assuming by fiat that the cognition and motivation parts of the brain are well-separated (and moreover, that the aliens are able to discover this, say by applying some coarse-grained transparency tools). So it’s just a toy model for how things *could* go, not necessarily how they *will* go.
Hmm. I think I agree that this is a weak point of the argument and it’s not clear how to patch it. I think I had some intuition like, even once we have some sort of pretrained AGI algorithm (like an RL agent trained in simulation), we would have to fine-tune it on real-world tasks one at a time by coming up with a curriculum for each of those tasks; this seems easier to do for simple bounded tasks than for more open-ended ones (though in some sense that needs to be made more precise, and is maybe already assuming some things about alignment); and “research acceleration” seems like a much narrower task with a relatively well-defined training set of papers, books, etc. than “AI agent that competently runs a company”, so might still come first on those grounds. But even then it would have to come first by a large enough margin for insights from the research accelerator to actually be implemented, for this argument to work. So there’s at least a gap there...
Yeah, fair enough. I should have said that I don’t see a path for language models to get selection pressure in the direction of being catastrophically deceptive like in the old “AI getting out of the box” stories, so I think we agree.
Yes, with the exception that I don’t know if compute will be the bottleneck (that is my best guess; I think Ajeya’s report makes a good case for it; but I could see it being other factors as well).
I think the case for is basically “we see a bunch of very predictable performance lines; seems like they’ll continue to go up”. But more importantly I don’t know of any compelling counterpoints; the usual argument seems to be “but we don’t see any causal reasoning / abstraction / <insert property here> yet”, which I think is perfectly compatible with the scaling hypothesis (see e.g. this comment).
I see, that makes sense, and I think it does make sense as an intuition pump for what the “ML paradigm” is trying to do (though as you sort of mentioned I don’t expect that we can just do the motivation / cognition decomposition).
Definitely depends on how powerful you’re expecting the AI system to be. It seems like if you want to make the argument that AI will go well by default, you need the research accelerator to be quite powerful (or you have to combine with some argument like “AI alignment will be easy to solve”).
I don’t think papers, books, etc are a “relatively well-defined training set”. They’re a good source of knowledge, but if you imitate papers and books, you get a research accelerator that is limited by the capabilities of human scientists (well, actually much more limited, since it can’t run experiments). They might be a good source of pretraining data, but there would still be a lot of work to do to get a very powerful research accelerator.
Fwiw I’m not convinced that we avoid catastrophic deception either, but my thoughts here are pretty nebulous and I think that “we don’t know of a path to catastrophic deception” is a defensible position.