I think your argument summarizes thus: strong automated interpretability will become dangerous because improved self-knowledge will make it easier for AGI to self-improve.
When most alignment people talk about self-interpretability, they’re talking not about self-interpretation, but interpretation by outside AI tools.
Of course, it’s likely that AGI will be given access to such tools if it improves their capabilities. Which it probably will.
I think adding that distinction might make the importance of this issue clearer.
Well, tools like Pythia helps us peer inside the NN and helps us reason about how things works. The same tools can help the AGI reason about itself. Or the AGI develops its own better tools. What I am talking about is an AGI doing what the interpretability researchers are doing now (or what OpenAI is trying to do with GPT-4 interpreting GPT-2).
It doesn’t’ matter how, I don’t know how, I just wanted to point out the simple path to algorithmic foom even if we start with a NN.
I don’t see a simple path to algorithmic foom from AI interpretability. What NNs do can’t be turned into an algorithm by any known route.
However, I do think some parts of their reasoning might be adaptable to algorithms. And I think that adding algorithms to language models is a clear path to AGI, as I’ve written about in Capabilities and alignment of LLM cognitive architectures.
So your point stands. I think it might be clarified by going into more depth on how NNs might be adapted to algorithms.
What NNs do can’t be turned into an algorithm by any known route.
NN-> agorithms was one of my assumptions. Maybe I can relay my intuitions for why it is a good assumption:
For example in the paper https://arxiv.org/abs/2301.05217 they explore grokking by making a transformer learn to do modular addition, and then they reverse engineer what algorithm the training “came up with”. Furthermore, supporting my point in this post, the learned algorithm is also very far from being the most efficient, due to “living” inside a transformer. And so, in this example, if you imagine that we didn’t know what the network was doing, and someone was just trying to do the same thing that the NN did, but faster and more efficiently, it would study the network, look a the bonkers algo that it learned, realize what it does, and then write the three assembly code lines to actually do the modular addition so much faster (and more precise!) without wasting resources and time by using the big matrices in the transformer.
I can also tackle the problem from the other side: I assume (is it non-obvious?) that predicting-the-next-token can be also be done with algorithms and not only neural networks. I assume that Intelligence can also be made with algorithms rather than only NNs. And so there is very probably a correspondence: I can do the same thing in two different way. And so NN → agorithms is possible. Maybe this correspondence isn’t always in favour of more simpler algos and NNs are sometimes actually less complex, but it feels a bolder claim to for it to be true in general.
To support my claim more we could just look at the math. Transformers, RNN, etc… are just linear algebra and non-linear activation functions. You can write that down or even, just as an example, just fit the multi-dimensional curve with a nonlinear function, maybe just a polynomials: do a Taylor expansion and maybe you discard the term that contribute less, or something else entirely… I am reluctant to even give ideas on how to do it because of the dangers, but the NNs can most definitely be written down as a multivariate non-linear function. Hell, neural networks, in physics are often regarding as just fitting with many parameters a really complex function we don’t have the mathematical form of (sot he reverse of what I explained in this paragraph).
And neural networks can be evolved, which is their biggest strength. I do expect that predicting-the-next-token algorithms can be actually much better than GPT-4, by using the same analogy that Yudkowsky uses for why designed nanotech is probably much better than natural nanotech: the learned algorithms must be evolvable and so they sit around much shallower “loss potential well” than designed algorithms could be.
And it seems to me that this reverse engineering process is what is interpretability is all about. Or at least what the Holy Grail of interpretability is.
Now, as I’ve written down in my assumptions, I don’t know if any of the learned cognition algorithms can be written down efficiently enough to have an edge on NNs:
[I assume that] algorithms interpreted from matrix multiplications are efficient enough on available hardware. This is maybe my shakiest hypothesis: matrix multiplication in GPUs is actually pretty damn well optimized
Maybe I should write a sequel to this post showing my all of these intuitions and motivations on how NN->Algo is a possibility.
I hope I made some sense, and I didn’t just ramble nonsense 😁.
Sorry it took me so long to get back to this; I either missed it or didn’t have time to respond. I still don’t, so I’ll just summarize:
You’re saying that what NNs do could be made a lot more efficient by distilling it into algorithms.
I think you’re right about some cognitive functions but not others. That’s enough to make your argument accurate, so I suggest you focus on that in future iterations. (Maybe going from suicide to adding danger would be more more accurate).
I suggest this change because I think you’re wrong about a majority of cognition. The brain isn’t being inefficient in most of what it does. You’ve chosen arithmetic as your example. I totally agree that the brain performs arithmetic in a wildly inefficient way. But that establishes one end of a spectrum. The intuition that most of cognition could be vastly optimized with algorithms is highly debetable. After a couple of decades of working with NNs and thinking about how they perform human cognition, I have the opposite intuition: NNs are quite efficient (this isn’t to say that they couldn’t be made more efficient—surely they can!).
For instance, I’m pretty sure that humans use a monte carlo tree search algorithm to solve novel problems and do planning. That core search strucure can be simplified as an algorithm.
But the power of our search process comes from having excellent estimates of the semantic linkages between the problem and possible leaves in the tree, and excellent predictors of likely reward for each branch. Those estimates are provided by large networks with good learning rules. Those can’t be compressed into an algorithm particularly efficiently; neural network distillation would probably work about as efficiently as it’s possible to work. There are large computational costs because it’s a hard problem, not because the brain is approaching the problem in an inefficient way.
I’m not sure if that helps to convey my very different intuition or not. Like I said, I’ve got a limited time. I’m hoping to convey reaction to this post, in hopes it will clarify your future efforts. My reaction was “OK good point, but it’s hardly “suicide” to provide just one more route to self-improvement”. I think the crux is the intuition of how much of cognition can be made more efficient with an algorithm over a neural net. And I think most readers will share my intuition that it’s a small subset of cognition that can be made much more efficient in algorithms.
One reason is the usefulness of learning. NNs provide a way to constantly and efficiently improve the computation through learning. Unless there’s an equally efficient way to do that in closed form algorithms, they have a massive disadvantage in any area where more learning is likely to be useful. Here again, arithmetic is the exception that suggests a rule. Arithmetic is a closed cognitive function; we know exactly how it works and don’t need to learn more. Ways of solving new, important problems benefit massively from new learning.
“OK good point, but it’s hardly “suicide” to provide just one more route to self-improvement”
I admit the title is a little bit clickbaity, but given my list of assumption (which do include that NNs can be made more efficient by interpreting them) it does elucidate a path to foom (which does look like suicide without alignment).
Unless there’s an equally efficient way to do that in closed form algorithms, they have a massive disadvantage in any area where more learning is likely to be useful.
I’d like to point out that in this instance I was talking about the learned algorithm not the learning algorithm. Learning to learn is a can of worms I am not opening rn, even though it’s probably the area that you are referring to, but, still, I don’t really see a reason that there could not be more efficient undiscovered learning algorithms (and NN+GD was not learned, it was intelligently designed by us humans. Is NN+GD the best there is?).
Maybe I should clarify how I imagined the NN-AGI in this post: a single huge inscrutable NN like GPT. Maybe a different architecture, maybe a bunch of NNs in trench coat, but still mostly NN. If that is true then there is a lot of things that can be upgraded by writing them in code rather than keeping them in NNs (arithmetic is the easy example, MC tree search is another...). Whatever MC tree search the giant inscrutable matrices have implemented, they are probably really bad compared to sturdy old fashioned code.
Even if NNs are the best way to learn algorithms, they are not be the best way to design them. I am talking about the difference between evolvable and designable.
NN allow us to evolve algorithms, code allows us to intelligently design them: if there is no easy evolvable path to an algorithm, neural networks will fail.
The parallel to evolution is: evolution cannot make bones out of steel (even though they would be much better) because there is no shallow gradient to get steel (no way to have the recipe for steel-bones be in a way that if the recipe is slightly changed you still get something steel-like and useful). Evolution needs a smooth path from not-working to working while design doesn’t.
With intelligence the computations don’t need to be evolved (or learned) it can be designed, shaped with intent.
Are you really that confident that the steel equivalent of algorithms doesn’t exist? Even though as humans we have barely explored that area (nothing hard-coded comes close to even GPT-2)?
Do we have any (non-trivial) equivalent algorithm that works best inside a NN rather than code? I guess those might be the hardest to design/interpret so we won’t know for certain for a long time...
Arithmetic is a closed cognitive function; we know exactly how it works and don’t need to learn more.
If we knew exactly how make poems of math theorems (like GPT-4 does) that would make it a “closed cognitive function” too, right? Can that learned algorithm be reversed engineered from GPT-4? My answer is yes ⇒ foom ⇒ we ded.
Any type of self-improvement in an un-aligned AGI = death. And if it’s already better than human level, it might not even need to do a bit of self-improvement, just escape our control, and we’re dead. So I think the suicide is quite a bit of hyperbole, or at least stated poorly relative to the rest of the conceptual landscape here.
If the AGI is aligned when it self-improves with algorithmic refinement, reflective stability should probably cause it to stay aligned after, and we just have a faster benevolent superintelligences.
So this concern is one more route to self-improvement. And theres a big question of how good a route it is.
My points were:
learning is at least as important as runtime speed. Refining networks to algorithms helps with one but destroys the other
Writing poems, and most cognitive activity, will very likely not resolve to a more efficient algorithm like arithmetic does. Arithmetic is a special case; perception and planning in varied environments require broad semantic connections. Networks excel at those. Algorithms do not.
So I take this to be a minor, not a major, concern for alignment, relative to others.
So I take this to be a minor, not a major, concern for alignment, relative to others.
Oh sure, this was more a “look at this cool thing intelligent machines could do that should shut up people from saying things like ‘foom is impossible because training run are expensive’”.
learning is at least as important as runtime speed. Refining networks to algorithms helps with one but destroys the other
Writing poems, and most cognitive activity, will very likely not resolve to a more efficient algorithm like arithmetic does. Arithmetic is a special case; perception and planning in varied environments require broad semantic connections. Networks excel at those. Algorithms do not.
Please don’t read this as me being hostile, but… why? How sure can we be of this? How sure are you that things-better-than-neural-networks are not out there?
Do we have any (non-trivial) equivalent algorithm that works best inside a NN rather than code?
Btw I am no neuroscientists, so I could be missing a lot of the intuitions you got.
At the end of the day you seem to think that it can be possible to fully interpret and reverse engineer neural networks, but you just don’t believe that Good Old Fashioned AGI can exists and/or be better than training NNs weights?
I haven’t justified either of those statements; I hope to make the complete arguments in upcoming posts. For now I’ll just say that human cognition is solving tough problems, and there’s no good reason to think that algorithms would be lots more efficient than networks in solving those problems.
I’ll also reference Morevec’s Paradox as an intuition pump. Things that are hard for humans, like chess and arithmetic are easy for computers (algorithms); things that are easy for humans, like vision and walking, are hard for algorithms.
I definitely do not think it’s pragmatically possible to fully interpret or reverse engineer neural networks. I think it’s possible to do it adequately to create aligned AGI, but that’s a much weaker criteria.
Hell, neural networks, in physics are often regarding as just fitting with many parameters a really complex function we don’t have the mathematical form of (sot hhe reverse of what I explained in this paragraph).
Basically I expect the neural networks to be a crude approximation of a hard-coded cognition algorithm. Not the other way around.
Nice.
I think your argument summarizes thus: strong automated interpretability will become dangerous because improved self-knowledge will make it easier for AGI to self-improve.
When most alignment people talk about self-interpretability, they’re talking not about self-interpretation, but interpretation by outside AI tools.
Of course, it’s likely that AGI will be given access to such tools if it improves their capabilities. Which it probably will.
I think adding that distinction might make the importance of this issue clearer.
Well, tools like Pythia helps us peer inside the NN and helps us reason about how things works. The same tools can help the AGI reason about itself. Or the AGI develops its own better tools. What I am talking about is an AGI doing what the interpretability researchers are doing now (or what OpenAI is trying to do with GPT-4 interpreting GPT-2).
It doesn’t’ matter how, I don’t know how, I just wanted to point out the simple path to algorithmic foom even if we start with a NN.
Oh, I see.
I don’t see a simple path to algorithmic foom from AI interpretability. What NNs do can’t be turned into an algorithm by any known route.
However, I do think some parts of their reasoning might be adaptable to algorithms. And I think that adding algorithms to language models is a clear path to AGI, as I’ve written about in Capabilities and alignment of LLM cognitive architectures.
So your point stands. I think it might be clarified by going into more depth on how NNs might be adapted to algorithms.
NN-> agorithms was one of my assumptions. Maybe I can relay my intuitions for why it is a good assumption:
For example in the paper https://arxiv.org/abs/2301.05217 they explore grokking by making a transformer learn to do modular addition, and then they reverse engineer what algorithm the training “came up with”. Furthermore, supporting my point in this post, the learned algorithm is also very far from being the most efficient, due to “living” inside a transformer. And so, in this example, if you imagine that we didn’t know what the network was doing, and someone was just trying to do the same thing that the NN did, but faster and more efficiently, it would study the network, look a the bonkers algo that it learned, realize what it does, and then write the three assembly code lines to actually do the modular addition so much faster (and more precise!) without wasting resources and time by using the big matrices in the transformer.
I can also tackle the problem from the other side: I assume (is it non-obvious?) that predicting-the-next-token can be also be done with algorithms and not only neural networks. I assume that Intelligence can also be made with algorithms rather than only NNs. And so there is very probably a correspondence: I can do the same thing in two different way. And so NN → agorithms is possible. Maybe this correspondence isn’t always in favour of more simpler algos and NNs are sometimes actually less complex, but it feels a bolder claim to for it to be true in general.
To support my claim more we could just look at the math. Transformers, RNN, etc… are just linear algebra and non-linear activation functions. You can write that down or even, just as an example, just fit the multi-dimensional curve with a nonlinear function, maybe just a polynomials: do a Taylor expansion and maybe you discard the term that contribute less, or something else entirely… I am reluctant to even give ideas on how to do it because of the dangers, but the NNs can most definitely be written down as a multivariate non-linear function. Hell, neural networks, in physics are often regarding as just fitting with many parameters a really complex function we don’t have the mathematical form of (sot he reverse of what I explained in this paragraph).
And neural networks can be evolved, which is their biggest strength. I do expect that predicting-the-next-token algorithms can be actually much better than GPT-4, by using the same analogy that Yudkowsky uses for why designed nanotech is probably much better than natural nanotech: the learned algorithms must be evolvable and so they sit around much shallower “loss potential well” than designed algorithms could be.
And it seems to me that this reverse engineering process is what is interpretability is all about. Or at least what the Holy Grail of interpretability is.
Now, as I’ve written down in my assumptions, I don’t know if any of the learned cognition algorithms can be written down efficiently enough to have an edge on NNs:
Maybe I should write a sequel to this post showing my all of these intuitions and motivations on how NN->Algo is a possibility.
I hope I made some sense, and I didn’t just ramble nonsense 😁.
Sorry it took me so long to get back to this; I either missed it or didn’t have time to respond. I still don’t, so I’ll just summarize:
You’re saying that what NNs do could be made a lot more efficient by distilling it into algorithms.
I think you’re right about some cognitive functions but not others. That’s enough to make your argument accurate, so I suggest you focus on that in future iterations. (Maybe going from suicide to adding danger would be more more accurate).
I suggest this change because I think you’re wrong about a majority of cognition. The brain isn’t being inefficient in most of what it does. You’ve chosen arithmetic as your example. I totally agree that the brain performs arithmetic in a wildly inefficient way. But that establishes one end of a spectrum. The intuition that most of cognition could be vastly optimized with algorithms is highly debetable. After a couple of decades of working with NNs and thinking about how they perform human cognition, I have the opposite intuition: NNs are quite efficient (this isn’t to say that they couldn’t be made more efficient—surely they can!).
For instance, I’m pretty sure that humans use a monte carlo tree search algorithm to solve novel problems and do planning. That core search strucure can be simplified as an algorithm.
But the power of our search process comes from having excellent estimates of the semantic linkages between the problem and possible leaves in the tree, and excellent predictors of likely reward for each branch. Those estimates are provided by large networks with good learning rules. Those can’t be compressed into an algorithm particularly efficiently; neural network distillation would probably work about as efficiently as it’s possible to work. There are large computational costs because it’s a hard problem, not because the brain is approaching the problem in an inefficient way.
I’m not sure if that helps to convey my very different intuition or not. Like I said, I’ve got a limited time. I’m hoping to convey reaction to this post, in hopes it will clarify your future efforts. My reaction was “OK good point, but it’s hardly “suicide” to provide just one more route to self-improvement”. I think the crux is the intuition of how much of cognition can be made more efficient with an algorithm over a neural net. And I think most readers will share my intuition that it’s a small subset of cognition that can be made much more efficient in algorithms.
One reason is the usefulness of learning. NNs provide a way to constantly and efficiently improve the computation through learning. Unless there’s an equally efficient way to do that in closed form algorithms, they have a massive disadvantage in any area where more learning is likely to be useful. Here again, arithmetic is the exception that suggests a rule. Arithmetic is a closed cognitive function; we know exactly how it works and don’t need to learn more. Ways of solving new, important problems benefit massively from new learning.
Thanks for coming back to me.
I admit the title is a little bit clickbaity, but given my list of assumption (which do include that NNs can be made more efficient by interpreting them) it does elucidate a path to foom (which does look like suicide without alignment).
I’d like to point out that in this instance I was talking about the learned algorithm not the learning algorithm. Learning to learn is a can of worms I am not opening rn, even though it’s probably the area that you are referring to, but, still, I don’t really see a reason that there could not be more efficient undiscovered learning algorithms (and NN+GD was not learned, it was intelligently designed by us humans. Is NN+GD the best there is?).
Maybe I should clarify how I imagined the NN-AGI in this post: a single huge inscrutable NN like GPT. Maybe a different architecture, maybe a bunch of NNs in trench coat, but still mostly NN. If that is true then there is a lot of things that can be upgraded by writing them in code rather than keeping them in NNs (arithmetic is the easy example, MC tree search is another...). Whatever MC tree search the giant inscrutable matrices have implemented, they are probably really bad compared to sturdy old fashioned code.
Even if NNs are the best way to learn algorithms, they are not be the best way to design them. I am talking about the difference between evolvable and designable.
NN allow us to evolve algorithms, code allows us to intelligently design them: if there is no easy evolvable path to an algorithm, neural networks will fail.
The parallel to evolution is: evolution cannot make bones out of steel (even though they would be much better) because there is no shallow gradient to get steel (no way to have the recipe for steel-bones be in a way that if the recipe is slightly changed you still get something steel-like and useful). Evolution needs a smooth path from not-working to working while design doesn’t.
With intelligence the computations don’t need to be evolved (or learned) it can be designed, shaped with intent.
Are you really that confident that the steel equivalent of algorithms doesn’t exist? Even though as humans we have barely explored that area (nothing hard-coded comes close to even GPT-2)?
Do we have any (non-trivial) equivalent algorithm that works best inside a NN rather than code? I guess those might be the hardest to design/interpret so we won’t know for certain for a long time...
If we knew exactly how make poems of math theorems (like GPT-4 does) that would make it a “closed cognitive function” too, right? Can that learned algorithm be reversed engineered from GPT-4? My answer is yes ⇒ foom ⇒ we ded.
Any type of self-improvement in an un-aligned AGI = death. And if it’s already better than human level, it might not even need to do a bit of self-improvement, just escape our control, and we’re dead. So I think the suicide is quite a bit of hyperbole, or at least stated poorly relative to the rest of the conceptual landscape here.
If the AGI is aligned when it self-improves with algorithmic refinement, reflective stability should probably cause it to stay aligned after, and we just have a faster benevolent superintelligences.
So this concern is one more route to self-improvement. And theres a big question of how good a route it is.
My points were:
learning is at least as important as runtime speed. Refining networks to algorithms helps with one but destroys the other
Writing poems, and most cognitive activity, will very likely not resolve to a more efficient algorithm like arithmetic does. Arithmetic is a special case; perception and planning in varied environments require broad semantic connections. Networks excel at those. Algorithms do not.
So I take this to be a minor, not a major, concern for alignment, relative to others.
Sorry for taking long to get back to you.
Oh sure, this was more a “look at this cool thing intelligent machines could do that should shut up people from saying things like ‘foom is impossible because training run are expensive’”.
Please don’t read this as me being hostile, but… why? How sure can we be of this? How sure are you that things-better-than-neural-networks are not out there?
Do we have any (non-trivial) equivalent algorithm that works best inside a NN rather than code?
Btw I am no neuroscientists, so I could be missing a lot of the intuitions you got.
At the end of the day you seem to think that it can be possible to fully interpret and reverse engineer neural networks, but you just don’t believe that Good Old Fashioned AGI can exists and/or be better than training NNs weights?
I haven’t justified either of those statements; I hope to make the complete arguments in upcoming posts. For now I’ll just say that human cognition is solving tough problems, and there’s no good reason to think that algorithms would be lots more efficient than networks in solving those problems.
I’ll also reference Morevec’s Paradox as an intuition pump. Things that are hard for humans, like chess and arithmetic are easy for computers (algorithms); things that are easy for humans, like vision and walking, are hard for algorithms.
I definitely do not think it’s pragmatically possible to fully interpret or reverse engineer neural networks. I think it’s possible to do it adequately to create aligned AGI, but that’s a much weaker criteria.
Please fix (or remove) the link.
Done, thanks!
Basically I expect the neural networks to be a crude approximation of a hard-coded cognition algorithm. Not the other way around.