Sort of a side point, but something that’s been helpful to me in this post and others in the past year is reconceptualizing the Fast/Slow takeoff into “Continuous” vs “Hard” takeoff, which suggest different strategic considerations. This particular post helped flesh out some of my models of what considerations are at play.
Is it a correct summary of the final point: “either this doesn’t really impact the field, so it doesn’t increase capabilities; or, it successfully moves the ML field from ‘everything is opaque and terrifying’ to ’people are at least trying to build models of what their software is doing, which is net positive for getting good practices for alignment into the mainstream?”
That’s an interesting and clever point (although it triggers some sort of “clever argument” safeguard that makes me cautious of it). The main counterpoint that comes to mind is a possible world where “opaque AIs” just can’t ever achieve general intelligence, but moderately well-thought-out AI designs can bridge the gap to “general intelligence/agency” without being reliable enough to be aligned.
Yep, I think that’s a correct summary of the final point.
The main counterpoint that comes to mind is a possible world where “opaque AIs” just can’t ever achieve general intelligence, but moderately well-thought-out AI designs can bridge the gap to “general intelligence/agency” without being reliable enough to be aligned.
Well, we know it’s possible to achieve general intelligence via dumb black box search—evolution did it—and we’ve got lots of evidence for current black box approaches being quite powerful. So it seems unlikely to me that we “just can’t ever achieve general intelligence” with black box approaches, though it could be that doing so is much more difficult than if you have more of an understanding.
Also, ease of aligning a particular AI design is a relative property, not an absolute one. When you say transparent approaches might not be “reliable enough to be aligned” you could mean that they’ll be just as likely likely as black box approaches to be aligned, less likely, or that they won’t be able to meet some benchmark threshold probability of safety. I would guess that transparency will increase the probability of alignment relative to not having it, though I would say that it’s unclear currently by how much.
The way I generally like to think about this is that there are many possible roads we can take to get to AGI, with some being more alignable and some being less alignable and some being shorter and some being longer. Then, the argument here is that transparency research opens up additional avenues which are more alignable, but which may be shorter or longer. Even if they’re shorter, however, since they’re more alignable the idea is that even if we end up taking the fastest path without regards to safety, if you can make the fastest path available to us a safer one, then that’s a win.
One thing I’d add, in addition to Evan’s comments, is that the present ML paradigm and Neural Architecture Search are formidable competitors. It feels like there’s a big gap in effectiveness, where we’d need to make lots of progress for “principled model design” to be competitive with them in a serious way. The gap causes me to believe that we’ll have (and already have had) significant returns on interpretability before we see capabilities acceleration. If it felt like interpretability was accelerating capabilities on the present margin, I’d be a bit more cautious about this type of argumentation.
(To date, I think the best candidate for a capabilities success case from this approach is Deconvolution and Checkerboard Artifacts. I think it’s striking that the success was less about improving a traditional benchmark, and more about getting models to do what we intend.)
What if we think about it the following way? ML researchers range from _theorists_ (who try to produce theories that describe how ML/AI/intelligence works at the deep level and how to build it) to _experimenters_ (who put things together using some theory and lots of trial and error and try to make it perform well on the benchmarks). Most people will be somewhere in between on this spectrum but people focusing on interpretability will be further towards theorists than most of the field.
Now let’s say we boost the theorists and they produce a lot of explanations that make better sense of the state of the art that experimenters have been playing with. The immediate impact of this will be improved understanding of our best models and this is good for safety. However, when the experimenters read these papers, their search space (of architectures, hyperparameters, training regimes, etc.) is reduced and they are now able to search more efficiently. Standing on the shoulders of the new theories they produce even better performing models (however they still incorporate a lot of trial and error because this is what experimenters do).
So what we achieved is better understanding of the current state of the art models combined with new improved state of the art that we still don’t quite understand. It’s not immediately clear whether we’re better off this way. Or is this model too coarse to see what’s going on?
That’s an interesting and clever point (although it triggers some sort of “clever argument” safeguard that makes me cautious of it).
I think it shouldn’t be in the “clever argument” category, and the only reason it feels like that is because you’re using the capabilities-alignment framework.
Consider instead this worldview:
The way you build things that are useful and do what you want is to understand how things work and put them together in a deliberate way. If you put things together randomly, they either won’t work, or will have unintended side effects.
(This worldview can apply to far more than AI; e.g. it seems right in basically every STEM field. You might argue that putting things together randomly seems to work surprisingly well in AI, to which I say that it really doesn’t, you just don’t see all of the effort where you put things together randomly and it simply flat-out fails.)
The argument “it’s good for people to understand AI techniques better even if it accelerates AGI” is a very straightforward non-clever consequence of this worldview.
Somewhat more broadly, I recommend being able to inhabit this other worldview. I expect it to be more useful / accurate than the capabilities / alignment worldview.
(Disclaimer: I believed this point before this post—in fact I had several conversations with people about it back in May, when I was considering a project with potential effects along these lines.)
I’m not sure I understand the difference between this worldview and my own. (The phrase-in-italics in your comment seemed fairly integral to how I was thinking about alignment/capabilities in the first place).
This recent comment of yours seems more relevant as far as worldview differences go, i.e. ‘if you expect discontinuous takeoff, then transparency is unlikely to do what you want’. (some slightly more vague “what counts as a clever argument” disagreement might be relevant too, although I’m not sure I can state my worry crisply, nor really confident my worry is cogent)
I don’t have a strong position on the continuous/hard-takeoff debate and have updated a bit over the past year both on continuous-takeoff’s plausibility as well as the value in shifting the AI field towards having clearer models of what they’re building, generally. But insofar as I’m suspicious of this, it’s mostly because I still put moderate probability on “some understanding here may be more dangerous than no understanding, precisely because it’s enough to accomplish some things without accomplishing everything that you needed to.”
some understanding here may be more dangerous than no understanding, precisely because it’s enough to accomplish some things without accomplishing everything that you needed to.
Fwiw, under the worldview I’m outlining, this sounds like a “clever argument” to me, that I would expect on priors to be less likely to be true, regardless of my position on takeoff. (Takeoff does matter, in that I expect that this worldview is not very accurate/good if there’s discontinuous takeoff, but imputing the worldview I don’t think takeoff matters.)
I often think of this as penalizing nth-order effects in proportion to some quickly-growing function of n. (Warning: I’m using the phrase “nth-order effects” in a non-standard, non-technical way.)
Under the worldview I mentioned, the first-order effect of better understanding of AI systems, is that you are more likely to build AI systems that are useful and do what you want.
The second-order effect is “maybe there’s a regime where you can build capable-but-not-safe things; if we’re currently below that, it’s bad to go up into that regime”. This requires a more complicated model of the world (given this worldview) and more assumptions of where we are.
(Also, now that I’ve written this out, the model also predicts there’s no chance of solving alignment, because we’ll first reach the capable-but-not-safe things, and die. Probably the best thing to do on this model is to race ahead on understanding as fast as possible, and hope we leapfrog directly to the capable-and-safe regime? Or you work on understanding AI in secret, and only release once you know how to do capable-and-safe, so that no one has the chance to work on capable-but-not-safe? You can see why this argument feels a bit off under the worldview I outlined.)
Takeoff does matter, in that I expect that this worldview is not very accurate/good if there’s discontinuous takeoff, but imputing the worldview I don’t think takeoff matters.
Minor question: could you clarify what you mean by “imputing the worldview” here? Do you mean something like, “operating within the worldview”? (I ask because this doesn’t seem to be a use of “impute” that I’m familiar with.)
Do you mean something like, “operating within the worldview”?
Basically yes. Longer version: “Suppose we were in scenario X. Normally, in such a scenario, I would discard this worldview, or put low weight on it, because reason Y. But suppose by fiat that I continue to use the worldview, with no other changes made to scenario X. Then …”
It’s meant to be analogous to imputing a value in a causal Bayes net, where you simply “suppose” that some event happened, and don’t update on anything causally upstream, but only reason forward about things that are causally downstream. (I seem to recall Scott Garrabrant writing a good post on this, but I can’t find it now. ETA: Found it, it’s here, but it doesn’t use the term “impute” at all. I’m now worried that I literally made up the term, and it doesn’t actually have any existing technical meaning.)
I expect it to be more useful / accurate than the capabilities / alignment worldview.
To note, I sort of interpreted the capabilities/alignment tradeoff as more related to things that enhance capabilities while providing essentially no greater understanding. Increasing compute is the primary example I can think of.
Sort of a side point, but something that’s been helpful to me in this post and others in the past year is reconceptualizing the Fast/Slow takeoff into “Continuous” vs “Hard” takeoff, which suggest different strategic considerations. This particular post helped flesh out some of my models of what considerations are at play.
Is it a correct summary of the final point: “either this doesn’t really impact the field, so it doesn’t increase capabilities; or, it successfully moves the ML field from ‘everything is opaque and terrifying’ to ’people are at least trying to build models of what their software is doing, which is net positive for getting good practices for alignment into the mainstream?”
That’s an interesting and clever point (although it triggers some sort of “clever argument” safeguard that makes me cautious of it). The main counterpoint that comes to mind is a possible world where “opaque AIs” just can’t ever achieve general intelligence, but moderately well-thought-out AI designs can bridge the gap to “general intelligence/agency” without being reliable enough to be aligned.
Yep, I think that’s a correct summary of the final point.
Well, we know it’s possible to achieve general intelligence via dumb black box search—evolution did it—and we’ve got lots of evidence for current black box approaches being quite powerful. So it seems unlikely to me that we “just can’t ever achieve general intelligence” with black box approaches, though it could be that doing so is much more difficult than if you have more of an understanding.
Also, ease of aligning a particular AI design is a relative property, not an absolute one. When you say transparent approaches might not be “reliable enough to be aligned” you could mean that they’ll be just as likely likely as black box approaches to be aligned, less likely, or that they won’t be able to meet some benchmark threshold probability of safety. I would guess that transparency will increase the probability of alignment relative to not having it, though I would say that it’s unclear currently by how much.
The way I generally like to think about this is that there are many possible roads we can take to get to AGI, with some being more alignable and some being less alignable and some being shorter and some being longer. Then, the argument here is that transparency research opens up additional avenues which are more alignable, but which may be shorter or longer. Even if they’re shorter, however, since they’re more alignable the idea is that even if we end up taking the fastest path without regards to safety, if you can make the fastest path available to us a safer one, then that’s a win.
One thing I’d add, in addition to Evan’s comments, is that the present ML paradigm and Neural Architecture Search are formidable competitors. It feels like there’s a big gap in effectiveness, where we’d need to make lots of progress for “principled model design” to be competitive with them in a serious way. The gap causes me to believe that we’ll have (and already have had) significant returns on interpretability before we see capabilities acceleration. If it felt like interpretability was accelerating capabilities on the present margin, I’d be a bit more cautious about this type of argumentation.
(To date, I think the best candidate for a capabilities success case from this approach is Deconvolution and Checkerboard Artifacts. I think it’s striking that the success was less about improving a traditional benchmark, and more about getting models to do what we intend.)
What if we think about it the following way? ML researchers range from _theorists_ (who try to produce theories that describe how ML/AI/intelligence works at the deep level and how to build it) to _experimenters_ (who put things together using some theory and lots of trial and error and try to make it perform well on the benchmarks). Most people will be somewhere in between on this spectrum but people focusing on interpretability will be further towards theorists than most of the field.
Now let’s say we boost the theorists and they produce a lot of explanations that make better sense of the state of the art that experimenters have been playing with. The immediate impact of this will be improved understanding of our best models and this is good for safety. However, when the experimenters read these papers, their search space (of architectures, hyperparameters, training regimes, etc.) is reduced and they are now able to search more efficiently. Standing on the shoulders of the new theories they produce even better performing models (however they still incorporate a lot of trial and error because this is what experimenters do).
So what we achieved is better understanding of the current state of the art models combined with new improved state of the art that we still don’t quite understand. It’s not immediately clear whether we’re better off this way. Or is this model too coarse to see what’s going on?
I think it shouldn’t be in the “clever argument” category, and the only reason it feels like that is because you’re using the capabilities-alignment framework.
Consider instead this worldview:
The way you build things that are useful and do what you want is to understand how things work and put them together in a deliberate way. If you put things together randomly, they either won’t work, or will have unintended side effects.
(This worldview can apply to far more than AI; e.g. it seems right in basically every STEM field. You might argue that putting things together randomly seems to work surprisingly well in AI, to which I say that it really doesn’t, you just don’t see all of the effort where you put things together randomly and it simply flat-out fails.)
The argument “it’s good for people to understand AI techniques better even if it accelerates AGI” is a very straightforward non-clever consequence of this worldview.
Somewhat more broadly, I recommend being able to inhabit this other worldview. I expect it to be more useful / accurate than the capabilities / alignment worldview.
(Disclaimer: I believed this point before this post—in fact I had several conversations with people about it back in May, when I was considering a project with potential effects along these lines.)
I’m not sure I understand the difference between this worldview and my own. (The phrase-in-italics in your comment seemed fairly integral to how I was thinking about alignment/capabilities in the first place).
This recent comment of yours seems more relevant as far as worldview differences go, i.e. ‘if you expect discontinuous takeoff, then transparency is unlikely to do what you want’. (some slightly more vague “what counts as a clever argument” disagreement might be relevant too, although I’m not sure I can state my worry crisply, nor really confident my worry is cogent)
I don’t have a strong position on the continuous/hard-takeoff debate and have updated a bit over the past year both on continuous-takeoff’s plausibility as well as the value in shifting the AI field towards having clearer models of what they’re building, generally. But insofar as I’m suspicious of this, it’s mostly because I still put moderate probability on “some understanding here may be more dangerous than no understanding, precisely because it’s enough to accomplish some things without accomplishing everything that you needed to.”
Fwiw, under the worldview I’m outlining, this sounds like a “clever argument” to me, that I would expect on priors to be less likely to be true, regardless of my position on takeoff. (Takeoff does matter, in that I expect that this worldview is not very accurate/good if there’s discontinuous takeoff, but imputing the worldview I don’t think takeoff matters.)
I often think of this as penalizing nth-order effects in proportion to some quickly-growing function of n. (Warning: I’m using the phrase “nth-order effects” in a non-standard, non-technical way.)
Under the worldview I mentioned, the first-order effect of better understanding of AI systems, is that you are more likely to build AI systems that are useful and do what you want.
The second-order effect is “maybe there’s a regime where you can build capable-but-not-safe things; if we’re currently below that, it’s bad to go up into that regime”. This requires a more complicated model of the world (given this worldview) and more assumptions of where we are.
(Also, now that I’ve written this out, the model also predicts there’s no chance of solving alignment, because we’ll first reach the capable-but-not-safe things, and die. Probably the best thing to do on this model is to race ahead on understanding as fast as possible, and hope we leapfrog directly to the capable-and-safe regime? Or you work on understanding AI in secret, and only release once you know how to do capable-and-safe, so that no one has the chance to work on capable-but-not-safe? You can see why this argument feels a bit off under the worldview I outlined.)
I was going to write a comment here, but it got a bit long so I made a post instead.
Minor question: could you clarify what you mean by “imputing the worldview” here? Do you mean something like, “operating within the worldview”? (I ask because this doesn’t seem to be a use of “impute” that I’m familiar with.)
Basically yes. Longer version: “Suppose we were in scenario X. Normally, in such a scenario, I would discard this worldview, or put low weight on it, because reason Y. But suppose by fiat that I continue to use the worldview, with no other changes made to scenario X. Then …”
It’s meant to be analogous to imputing a value in a causal Bayes net, where you simply “suppose” that some event happened, and don’t update on anything causally upstream, but only reason forward about things that are causally downstream. (I seem to recall Scott Garrabrant writing a good post on this, but I can’t find it now. ETA: Found it, it’s here, but it doesn’t use the term “impute” at all. I’m now worried that I literally made up the term, and it doesn’t actually have any existing technical meaning.)
Aha! I thought it might be borrowing language from some technical term I wasn’t familiar with. Thanks!
To note, I sort of interpreted the capabilities/alignment tradeoff as more related to things that enhance capabilities while providing essentially no greater understanding. Increasing compute is the primary example I can think of.