As someone with limited knowledge of AI or alignment, I found this post accessible. There were times when I thought I knew vaguely what Nate meant but would not be able to explain it so I’m recording my confusions here to come back to when I’ve read up more. (If anyone wants to answer any of these r/NoStupidQuestions questions, that would be very helpful too).
“Your first problem is that the recent capabilities gains made by the AGI might not have come from gradient descent”. This is something that comes up in response to a few of the plans. Is the idea that during training, for advanced enough AIs capabilities gains come from gradient descent and also through processing input / interacting with the world. Or is the second part only after it has finished training. What does that concretely look like in ML?
Is a lot of the disagreement about these plans just because of others finding the idea of a “sharp left turn” more unlikely than Nate or is there more agreement about that idea but the disagreement is about what proposals might give us a shot at solving it?
What might an ambitious interpretability agenda focused on the sharp left turn and the generalization problem look like besides just trying harder at interpretability?
Another explanation of the “sharp left turn” would also be really helpful to me. At the moment, it feels like I can only explain why that happens by using analogies to humans/apes rather than being able to give a clear explanation for why we should expect that by default, using ML/alignment language.
What might an ambitious interpretability agenda focused on the sharp left turn and the generalization problem look like besides just trying harder at interpretability?
Some key pieces...
Desiderata 1: we need to aim for some kind of interpretability which will carry over across architectural/training paradigm changes, internal ontology shifts at runtime, etc. The tools need to work without needing a lot of new investment everytime there’s a big change.
In my own approach, that’s what Selection Theorems would give us: theorems which characterize certain interpretable internal structures as instrumentally convergent across a wide range of architecture/internal ontology.
Desiderata 2: we need to be able to robustly tie the internal structures identified to some kind of high-level human-interpretable “things”. The “things” could be mathematical, like e.g. we might aim to robustly recognize embedded search processes or embedded world models. Or, the “things” could be real-world things, like e.g. we might aim to robustly recognize embedded representations of natural abstractions from the environment (and the natural abstractions in the environment to which the representations correspond). Either way, this would have to involve more than just a bunch of proxies which are vaguely correlated with the human-intuitive concept(s); the correspondence both between learned representation and mathematical/real-world structure, and between human concept and mathematical/real-world structure, would have to be highly robust.
In my own approach, that’s what the formalization of natural abstractions would give us: theorems which let us robustly talk about the things-which-embedded-representations-represent, in a way which also ties those things to human concepts.
Desiderata 3: we need to somehow guarantee that there’s no important/dangerous cognitive work routing around the interpretable structures. E.g. if we’re aiming to recognize embedded search processes, we need to somehow guarantee that there’s optimization performed in a way which would circumvent things-recognized-by-our-search-process-interpretability-tool. Or if we’re aiming to recognize representations of natural abstractions in general, then we need to somehow guarantee that no important/dangerous cognitive work is routing through channels other than those concepts.
The natural abstraction framework fits this desiderata particularly well, since it directly talks about abstractions which summarize all the information relevant at a distance. There’s no capabilities to be gained by using non-natural abstractions.
Finally, one thing which is not a desiderata but is an important barrier which most current interpretability work fails to tackle: interpretability is not compositional/reductive. If I understand each of 100 parts in isolation, that does not mean that I understand a system consisting of those 100 parts together. (If interpretability were compositional/reductive, then we’d already understand neural nets just fine, because individual neurons and weights are very simple!)
For 1—In humans, there’s the distinction between evolution-as-a-learning-algorithm versus within-lifetime learning. There’s some difference of opinion about which of those two slots will be occupied by the PyTorch code comprising our future AGI—the RFLO model says that this code will be doing something analogous to evolution, I say it will be doing something analogous to within-lifetime learning, see my discussion here.
My impression (from their writings) is that Nate & Eliezer are firmly in the former RFLO/evolution camp. If that’s your picture, then within-lifetime learning is a thing that happens inside a learned black box, and thus it’s a big step removed from the gradient descent (imagine: the outer-loop evolution-like gradient descent tweaks the weights, then the trained model thinks and acts and learns and grows and plans for a billion subjective seconds, then the outer-loop evolution-like gradient descent tweaks the weights, then the trained model thinks and acts and learns and grows and plans for a billion subjective seconds…). Then a “sharp left turn” could happen between gradient-descent steps, for example.
In my model, the human-written AGI PyTorch code is instead analogous to within-lifetime learning in humans, and it looks kinda like actor-critic model-based RL. There’s still some gradient descent, but the loss function is not directly “performance”, instead it’s things like self-supervised learning, and then there are also non-gradient-descent things like TD learning too. “Sharp left turns” don’t show up in my picture, at least not the same way. Or I guess, maybe instead of just one “sharp left turn”, the training process would have millions of “sharp left turns” as it keeps learning new things about the world (e.g. learning object permanence, learning that it’s an AGI running on a computer, learning physics, etc.), and each of these is almost guaranteed to help capabilities, but can potentially screw up alignment.
For 2, I think a lot of it is finding the “sharp left turn” idea unlikely. I think trying to get agreement on that question would be valuable.
For 4, some of the arguments for it in this post (and comments) may help.
For 3, I’d be interested in there being some more investigation into and explanation of what “interpretability” is supposed to achieve (ideally with some technical desiderata). I think this might end up looking like agency foundations if done right.
For example, I’m particularly interested in how “interpretability” is supposed to work if, in some sense, much of the action of planning and achieving some outcome occurs far away from the code or neural network that played some role in precipitating it. E.g., one NN-based system convinces another more capable system to do something (including figuring out how); or an AI builds some successor AIs that go on to do most of the thinking required to get something done. What should “interpretability” do for us in these cases, assuming we only have access to the local system?
I think the upvotes, without answers, means that other people are also interested in hearing Nate’s clarifications on these questions, particularly #1.
2 is a mixture of both—examples will hopefully come as people comment their disagreements.
Ambitiousness in interpretability can look like greater generalization to never-before-seen architectures, especially automated generalization that doesn’t strictly need human intervention. It can also look like robustly being able to use interpretability tools to provide oversight to training, e.g. as “thought assessors.” I bet people more focused on interpretability have more ideas.
(Most of the QR-upvotes at the moment are from me. I think 1-4 are all good questions, for Nate or others; but I’m extra excited about people coming up with ideas for 3.)
As someone with limited knowledge of AI or alignment, I found this post accessible. There were times when I thought I knew vaguely what Nate meant but would not be able to explain it so I’m recording my confusions here to come back to when I’ve read up more. (If anyone wants to answer any of these r/NoStupidQuestions questions, that would be very helpful too).
“Your first problem is that the recent capabilities gains made by the AGI might not have come from gradient descent”. This is something that comes up in response to a few of the plans. Is the idea that during training, for advanced enough AIs capabilities gains come from gradient descent and also through processing input / interacting with the world. Or is the second part only after it has finished training. What does that concretely look like in ML?
Is a lot of the disagreement about these plans just because of others finding the idea of a “sharp left turn” more unlikely than Nate or is there more agreement about that idea but the disagreement is about what proposals might give us a shot at solving it?
What might an ambitious interpretability agenda focused on the sharp left turn and the generalization problem look like besides just trying harder at interpretability?
Another explanation of the “sharp left turn” would also be really helpful to me. At the moment, it feels like I can only explain why that happens by using analogies to humans/apes rather than being able to give a clear explanation for why we should expect that by default, using ML/alignment language.
Some key pieces...
Desiderata 1: we need to aim for some kind of interpretability which will carry over across architectural/training paradigm changes, internal ontology shifts at runtime, etc. The tools need to work without needing a lot of new investment everytime there’s a big change.
In my own approach, that’s what Selection Theorems would give us: theorems which characterize certain interpretable internal structures as instrumentally convergent across a wide range of architecture/internal ontology.
Desiderata 2: we need to be able to robustly tie the internal structures identified to some kind of high-level human-interpretable “things”. The “things” could be mathematical, like e.g. we might aim to robustly recognize embedded search processes or embedded world models. Or, the “things” could be real-world things, like e.g. we might aim to robustly recognize embedded representations of natural abstractions from the environment (and the natural abstractions in the environment to which the representations correspond). Either way, this would have to involve more than just a bunch of proxies which are vaguely correlated with the human-intuitive concept(s); the correspondence both between learned representation and mathematical/real-world structure, and between human concept and mathematical/real-world structure, would have to be highly robust.
In my own approach, that’s what the formalization of natural abstractions would give us: theorems which let us robustly talk about the things-which-embedded-representations-represent, in a way which also ties those things to human concepts.
Desiderata 3: we need to somehow guarantee that there’s no important/dangerous cognitive work routing around the interpretable structures. E.g. if we’re aiming to recognize embedded search processes, we need to somehow guarantee that there’s optimization performed in a way which would circumvent things-recognized-by-our-search-process-interpretability-tool. Or if we’re aiming to recognize representations of natural abstractions in general, then we need to somehow guarantee that no important/dangerous cognitive work is routing through channels other than those concepts.
The natural abstraction framework fits this desiderata particularly well, since it directly talks about abstractions which summarize all the information relevant at a distance. There’s no capabilities to be gained by using non-natural abstractions.
Finally, one thing which is not a desiderata but is an important barrier which most current interpretability work fails to tackle: interpretability is not compositional/reductive. If I understand each of 100 parts in isolation, that does not mean that I understand a system consisting of those 100 parts together. (If interpretability were compositional/reductive, then we’d already understand neural nets just fine, because individual neurons and weights are very simple!)
For 1—In humans, there’s the distinction between evolution-as-a-learning-algorithm versus within-lifetime learning. There’s some difference of opinion about which of those two slots will be occupied by the PyTorch code comprising our future AGI—the RFLO model says that this code will be doing something analogous to evolution, I say it will be doing something analogous to within-lifetime learning, see my discussion here.
My impression (from their writings) is that Nate & Eliezer are firmly in the former RFLO/evolution camp. If that’s your picture, then within-lifetime learning is a thing that happens inside a learned black box, and thus it’s a big step removed from the gradient descent (imagine: the outer-loop evolution-like gradient descent tweaks the weights, then the trained model thinks and acts and learns and grows and plans for a billion subjective seconds, then the outer-loop evolution-like gradient descent tweaks the weights, then the trained model thinks and acts and learns and grows and plans for a billion subjective seconds…). Then a “sharp left turn” could happen between gradient-descent steps, for example.
In my model, the human-written AGI PyTorch code is instead analogous to within-lifetime learning in humans, and it looks kinda like actor-critic model-based RL. There’s still some gradient descent, but the loss function is not directly “performance”, instead it’s things like self-supervised learning, and then there are also non-gradient-descent things like TD learning too. “Sharp left turns” don’t show up in my picture, at least not the same way. Or I guess, maybe instead of just one “sharp left turn”, the training process would have millions of “sharp left turns” as it keeps learning new things about the world (e.g. learning object permanence, learning that it’s an AGI running on a computer, learning physics, etc.), and each of these is almost guaranteed to help capabilities, but can potentially screw up alignment.
For 2, I think a lot of it is finding the “sharp left turn” idea unlikely. I think trying to get agreement on that question would be valuable.
For 4, some of the arguments for it in this post (and comments) may help.
For 3, I’d be interested in there being some more investigation into and explanation of what “interpretability” is supposed to achieve (ideally with some technical desiderata). I think this might end up looking like agency foundations if done right.
For example, I’m particularly interested in how “interpretability” is supposed to work if, in some sense, much of the action of planning and achieving some outcome occurs far away from the code or neural network that played some role in precipitating it. E.g., one NN-based system convinces another more capable system to do something (including figuring out how); or an AI builds some successor AIs that go on to do most of the thinking required to get something done. What should “interpretability” do for us in these cases, assuming we only have access to the local system?
I think the upvotes, without answers, means that other people are also interested in hearing Nate’s clarifications on these questions, particularly #1.
2 is a mixture of both—examples will hopefully come as people comment their disagreements.
Ambitiousness in interpretability can look like greater generalization to never-before-seen architectures, especially automated generalization that doesn’t strictly need human intervention. It can also look like robustly being able to use interpretability tools to provide oversight to training, e.g. as “thought assessors.” I bet people more focused on interpretability have more ideas.
(Most of the QR-upvotes at the moment are from me. I think 1-4 are all good questions, for Nate or others; but I’m extra excited about people coming up with ideas for 3.)