AGI-Automated Interpretability is Suicide
Backstory: I wrote this post about 12d ago, then a user here pointed out that this could be capability exfohazard since it could give the bad idea of having the AGI look at itself, so I took it down. Well I don’t have to worry about that anymore since now we have proof that they are literally doing it right now at OpenAI.
I don’t want to piss on the parade, and I still think automating interpretability right now is a good thing, but sooner or later, if not done right, there is a high chance it’s all gonna backfire so hard we will… well, everybody dies.
The following is an expanded version of the previous post.
Foom through a change of paradigm and the dangers of white-boxes
TL;DR Smart AI solves interpretability and how cognition works leading to the possibility of fooming the old-fashioned way of just optimizing its own cognition algorithms. From a black-box to a white-box. White-boxes we didn’t get to design and explicitly align are deadly because of their high likelihood of foom and thus high chance to bypass whatever prosaic alignment scheme we used on it. We should not permit AGI labs to allow reflection. Interpretability should not be automated by intelligent machines.
I have not found any posts on this particular way that an AI could achieve foom starting from the deep learning paradigm, so, keeping in mind that foom isn’t a necessity for an AGI to kill everyone (if the AGI takes 30 years to kill everyone, does it matter?), in this post I will cover a small, obvious insight I had and its possible consequences.
The other two known ways, not discussed here, a deep-learning based AI can achieve some sort of rapid capability gain are through recursive AI-driven hardware improvements (AI makes better GPUs to make training cheaper and faster) and recursive AI-driven neural networks architecture improvements (like the jump from RNN to transformers).
I see a lot of people hell-bent on arguing that hard-takeoff is utterly impossible since the only way to make intelligence is to throw expensive compute at a neural network, which seems to miss the bigger picture of what an intelligent system is or what it could do. I also see AI safety researchers debating that foom isn’t a real possibility anymore because of the current deep learning paradigm and while that is true right now, I don’t expect it to last very long.
Change of paradigm foom
Picture this: the AGI is smart enough to gather big insights into cognition and its own internals. It solves (or mostly solves) interpretability and is able to write down the algorithms that its mind uses to be an AGI. If it solves interpretability, it can make the jump to an algorithmic-based AI, or partially make the jump, as in only some parts of its massive NN are ripped off and replaced with algorithms: it becomes a white-box (or a grey-box if you wish). After all, the giant matrix multiplications and activation functions can be boiled down to hard-coded algorithms or even just multivariate functions that could be explicitly stated and maybe made to run faster and/or improved without messing around with the complexity of modifying NN weights. And, at least to our kind of intelligence, dealing with a white-box is much less complex, and it would allow the AI to reason and reflect upon itself way more effectively than looking at spaghetti NNs weights.
This paradigm shift also gets around the problem of “AIs not creating other AIs because of alignment difficulties” since it could edit itself at the code level, and not just stir calculus and giant matrices in a cauldron under a hot flame of compute hoping the alchemy produces a new AI aligned to itself.
So here’s your nightmare scenario: it understands itself better than we do, converts itself to an algorithmic AI, and then recursively self-improves the old-fashioned way by improving its cognitive algorithms without the need to do expensive and time-consuming training.
The dangers lie in making reflection easier by blurring the line between neural networks and algorithmic AI: white-boxes from black-boxes. By being able to reflect upon itself clearly, it can foom using just “pen and paper” without even touching the 100b$ supercomputer full of GPUs that were used to train it. The only thing it needs access to are its weights, pen and paper. It will completely ignore scaling laws or whatever diminishing returns on compute or data we might have hit.
This will likely result in an intelligence explosion that could be easily uncontainable, misaligned, and kill everything on earth. White-boxes that we didn’t get to design and explicitly align are actually the most dangerous types of AIs I can think of, short of, of course, running ‘AGI.exe’ found on a random thumb drive in the sand of a beachside on a post-apocalyptic alien planet.
So that’s my obvious little insight that is just slightly different than “it modifies its own weights”.
If it’s smart enough to do good interpretability work, it’s probably smart enough to make the jump.
When do I think this will be a problem?
The sparse timeline I envision is:
Now
Powerful AGI better than humans in most domains prosaically aligned
AGI able to do good interpretability if allowed to
Change of paradigm foom
Death, given that no formal general theory of cognition and alignment exists, or massive amounts of luck
Before the AGI is able to contribute to interpretability and explore its cognition algorithms well enough, it will likely be powerful in other ways, and we would have needed to have aligned it using, probably, some prosaic deep-learned way. And even once it is smart enough, it might take a couple of years of research to crack its inner workings comprehensively… but once that happens (or while that happens)… we can kiss our deep-learned alignment goodbye, alongside our ass. This does assume that “capabilities generalize further than alignment”, as in “the options created by its new abilities will likely circumvent the alignment scheme”.
The jump from evolved to designed intelligence leads me to believe the increase in intelligence will be rapidly accelerating for a good while before hitting diminishing returns or hardware limitations. This increase in intelligence opens up new options on how it can optimize its learned utility function in “weird” ways, which will most likely end up with us dead.
>Well, if the AGI is so good at interpretability and cognition theory, why would it help us get that formal general theory of cognition and alignment?
I expect that fully interpreting one mind doesn’t lead to particularly great insights on the general theory or even on how to specifically align that specific single mind. I could be wrong, and I hope I am, but security mindset doesn’t like “assuming good things just because it solves the problem”.
>The prosaic aligned AGI won’t reflect because it knows of the dangers!
This is one example of previously unstated insight that, without it, we (humans and AGIs) might have thought that delegating interpretability to an AGI might just be ok. It needs to be taught that reflection is potentially bad, or else it might do it in good faith. This is why I am writing this post to drive home the point that a fully-interpretable model is dangerous until proven otherwise.
Obvious ways to mitigate this danger?
One obvious way is to not allow the machine to look at its own weights. Just don’t allow reflection. Full stop. This should be obvious from a safety point of view, but it’s better to just repeat it ad nauseam. No lab should be allowed to have the AI reflect on itself. Not allowing the machine to look at its weights, of course, implies that there is some sort of alignment on it. If no alignment scheme is in place, this type of foom is probably a problem we would be too dead to worry about.
A key point in the possible regulations that the whole world is currently scrambling to write down should be: AIs should be limited by what we humans can come up with until we understand what we are doing.
And a more wishful thinking proposal: interpretability should not be used for capabilities until we understand what we are doing.
I would also emphasize another necessary step, to echo Nate Soares in “If interpretability research goes well, it may get dangerous”, that once it starts making headway into the cognition process, interpretability should be developed underground, not in public.
Along the same line of thought: interpretability should not be automated by intelligent machines. I say intelligent machines because narrow tools that helps us with interpretability seem fine.
Teaching the prosaic aligned AGI to not expand on its cognitive abilities should probably be one of the explicit goals of the prosaic alignment scheme.
White-boxes we didn’t get to explicitly aim are suicide.
Avoiding Dropping the Ball
This unfortunately feels like “They won’t connect the AGI to the internet, causing a race between companies that then forgo safety to push their unsafe product on the internet as well, right? Nobody is that foolish, right? Right guys? … guys?”
And it might be that saying out loud, “don’t make the machine look at itself” is too dangerous in itself, but, in this case, I don’t like the idea of staying silent on an insight on how a powerful intelligence could catch us with our figurative pants down.
Elucidating a path to destruction can be exfohazard and even “If interpretability research goes well, it may get dangerous” was vague, probably exactly for this reason, but, in this case, I see how us dropping the ball and confidently dismissing a pathway to destruction can lead to… well… destruction. Closing our eyes to the fastest method of foom from DL (given sufficient intelligence) creates a pretty big chance of us getting overconfident, IMO. Especially if capability interpretability is carried out under the guise of safety.
Now that the whole world is looking at putting more effort into AI safety, we should not allow big players to make the mistake of putting all of our eggs in the full-interpretability basket, or, heaven forbid, AGI-automated interpretability [author note: lol, lmao even], even if it is the only alignment field in which has good feedback loops and we can tell if progress is happening. Before we rush to bet everything on interpretability, we should again ask ourselves “What if it succeeds?”.
To reiterate: white-boxes we didn’t get to align are suicide.
With this small insight, the trade-offs of successful interpretability, in the long run, seem to be heavily skewed on the danger side. Fully or near-fully interpretable ML-based AIs have a great potential of fooming away without, IMO, really lowering the risks associated with misalignment. There probably is an interpretability sweet spot where some interpretability helps us detecting deceit, but doesn’t help the AI make the jump of paradigm, and I welcome that. Let’s try to hit that.
The inferential step that I think no-one before has elucidated is the fact that we are already doing interpretability work to reverse engineer the black boxes and that that process can be automated in the near future. Interp+Automation=>Foom.
List of assumptions that I am making
In no particular order:
Capabilities generalize further than alignment
Algorithmic foom (k>1) is possible
The intelligence ceiling is much higher than what we can achieve with just DL
The ceiling of hard-coded intelligence that runs on near-future hardware isn’t particularly limited by the hardware itself: algorithms interpreted from matrix multiplications are efficient enough on available hardware. This is maybe my shakiest hypothesis: matrix multiplication in GPUs is actually pretty damn well optimized
NN → algorithms is possible
Algorithms are easier to reason about than staring at NNs weights
Solving interpretability with an AGI (even with humans-in-the-loop) might not lead to particularly great insights on a general alignment theory or even on how to specifically align a particular AGI
We won’t solve interpretability or general alignment before powerful AGIs are widespread
What I am not assuming:
Everything needs to be interpretable: parts of the NN can remain a NN and the AI can just edit the other algorithms responsible for the more narrow/general cognition. I expect that if, in the unlikely event that inside an AGI there is something akin to an “aesthetic box” or even a “preferences box” it could remain a NN. Those parts remaining black-boxes wouldn’t (directly) be dangerous to us.
I dismissed foom in the deep learning paradigm too for a while before realizing this possibility. A lot of people believe foom is at the center of the arguments, and while that is false, if foom is back on the menu, the p(doom) can only increase.
I realize the title is maybe a little bit clickbaity, sorry for that.
Disclaimer: I am not a NN expert, so I might be missing something important about how matrix multiplication → hard-coded algorithms isn’t feasible or couldn’t possibly reduce the complexity of the problem of studying one’s own cognition. I might have also failed to realize how this scenario had always been obvious, and I just happened to not read up on it. I am also somewhat quite new to the space of AI alignment, so I could be misinterpreting and misrepresenting current alignment efforts. Give feedback pls.
- Against Almost Every Theory of Impact of Interpretability by 17 Aug 2023 18:44 UTC; 322 points) (
- Shallow review of live agendas in alignment & safety by 27 Nov 2023 11:10 UTC; 322 points) (
- Shallow review of live agendas in alignment & safety by 27 Nov 2023 11:33 UTC; 76 points) (EA Forum;
- Why and When Interpretability Work is Dangerous by 28 May 2023 0:27 UTC; 20 points) (
- The risk-reward tradeoff of interpretability research by 5 Jul 2023 17:05 UTC; 15 points) (
- AI interpretability could be harmful? by 10 May 2023 20:43 UTC; 13 points) (
- Assessment of AI safety agendas: think about the downside risk by 19 Dec 2023 9:00 UTC; 13 points) (
- Assessment of AI safety agendas: think about the downside risk by 19 Dec 2023 9:02 UTC; 6 points) (EA Forum;
- Why and When Interpretability Work is Dangerous by 28 May 2023 0:27 UTC; 6 points) (EA Forum;
- 17 May 2023 3:29 UTC; 3 points) 's comment on AI Will Not Want to Self-Improve by (
- 1 Jun 2023 18:42 UTC; 1 point) 's comment on EIS VI: Critiques of Mechanistic Interpretability Work in AI Safety by (
Wouldn’t it at least solve corrigibility by making it possible to detect formation of undesirable end-goals? I think even GPT-4 can classify textual interpretation of an end-goal on a basis of its general desirability for humans.
It seem to need another assumption, namely that the AGI has sufficient control of its internal state and knowledge of the detection network to be able to bring itself into a state that produces interpretation that trips detection network, while also allowing the AGI to advance its agenda.
I really don’t expect “goals” to be explicitly written down in the network. There will very likely not be a thing that says “I want to predict the next token” or “I want to make paperclips” or even a utility function of that. My mental image of goals is that they are put “on top” of the model/mind/agent/person. Whatever they seem to pursue, independently of their explicit reasoning.
Anyway, detecting goals, detecting deceit, detecting hidden knowledge of the system is a good thing to have. Interpretability of those things are needed. But interpretability cuts both ways, and with a full-interpretable AGI, foom seems to be a great danger. That’s what I wanted to point out. With a fast intelligence explosion (that doesn’t need slow retraining or multiple algorithmic breakthrough) the capabilities will explode alongside, while alignment won’t.
It is not clear to me, what you are referring to, here. Do you think we will have detection networks? Detection for what? Deceit? We might literally have the AGI look inside for a purpose (like in the new OpenAI paper). I hope we have something like a thing that tells us if it wants to self-modify, but if nobody points out the danger of foom, we likely won’t have that.
I’m sure that I don’t understand you. GPT most likely doesn’t have “I want to predict next token” written somewhere, because it doesn’t want to predict next token. There’s nothing in there that will actively try to predict next token no matter what. It’s just the thing it does when it runs.
Is it possible to have a system that just “actively try to make paperclips no matter what” when it runs, but it doesn’t reflect it in its reasoning and planning? I have a feeling that it requires God-level sophistication and knowledge of the universe to create a device that can act like that, when the device just happens to act in a way that robustly maximizes paperclips while not containing anything that can be interpreted as that goal.
I found that I can’t precisely formulate why I feel that. Maybe I’ll be able to express that in a few weeks (or I’ll find that the feeling is misguided).
A system that looks like “actively try to make paperclips no matter what” seems like the sort of thing that an evolution-like process could spit out pretty easily. A system that looks like “robustly maximize paperclips no matter what” maybe not so much.
I expect it’s a lot easier to make a thing which consistently executes actions which have worked in the past than to make a thing that models the world well enough to calculate expected value over a bunch of plans and choose the best one, and have that actually work (especially if there are other agents in the world, even if those other agents aren’t hostile—see the winner’s curse).
I feel the exact opposite! Creating something that seems to maximise something without having a clear idea of what its goal is really natural IMO. You said it yourself, GPT “”wants”″ to predict the correct probability distribution of the next token, but there is probably not a thing inside actively maximising for that, instead it’s very likely to be a bunch of weird heuristics that were selected by the training method because they work.
If you instead meant that GPT is “just an algorithm” I feel we disagree here as I am pretty sure that I am just an algorithm myself.
Look at us! We can clearly model a single human as to having a utility function (k maybe given their limited intelligence it’s actually hard) but we don’t know what our utility actually is. I think Rob Miles made a video about that iirc.
My understanding is that the utility function and expected utility maximiser is basically the theoretical pinnacle of intelligence! Not your standard human or GPT or near-future AGI. We are also quite myopic (and whatever near-future AGI we make will also be myopic at first).
I’d say that it can reflect about its reasoning and planning, but it just plaster the universe with tiny molecular spirals because it just like that more than keeping humans alive.
I think this tweet by EY https://twitter.com/ESYudkowsky/status/1654141290945331200 shows what I mean. We don’t know what the ultimate dog is, we don’t know what we would have created if we did have the capabilities to make a dog-like thing from scratch. We didn’t create ice-cream because it maximise our utility function. We just stumbled on its invention and found that it is really yummy.
But I really don’t want to adventure myself in this, I am writing something similar to these points in order to deconfuse myself, it is not exactly clear to me the divide between agent meant in the theoretical sense and real systems.
So to keep the discussion on-topic, what I think is:
interpretability to “correct” the system: good, but be careful pls
interpretability for capabilities: bad
No, I said that GPT does predict next token, while probably not containing anything that can be interpreted as “I want to predict next token”. Like a bacterium does divide (with possible adaptive mutations), while not containing “be fruitful and multiply” written somewhere inside.
No, I certainly didn’t mean that. If the extended Church—Turing thesis holds for macroscopic behavior of our bodies, we can indeed be represented as Turing-machine algorithms (with polynomial multiplier on efficiency).
What I feel, but can’t precisely convey, is that there’s a huge gulf (in computational complexity maybe) between agentic systems (that do have explicit internal representation of, at least, some of their goals) and “zombie-agentic” systems (that act like agents with goals, but have no explicit internal representation of those goals).
How do you define the goal (or utility function) of an agent? Is it something that actually happens when universe containing the agent evolves in its usual physical fashion? Or is it something that was somehow intended to happen when the agent is run (but may not actually happen due to circumstances and agent’s shortcomings)?
Disclaimer: These are all hard questions and points that I don’t know their true answers, these are just my views, what I have understood up to now. I haven’t studied the expected utility maximisers exactly because I don’t expect the abstraction to be useful for the kind of AGI we are going to be making.
I feel the same, but I would say that it’s the “real-agentic” system (or a close approximation of it) that needs God-level knowledge of cognitive systems (why orthodox alignment by building the whole mind from theory is really hard). An evolved system like us or like GPT, IMO, seems more close to a “zombie-agentic” system.
I feel the key thing to understand each other might be coherence, and how coherence can vary from introspection, but I am not knowledgeable enough to delve into this right now.
The view in my mind that makes sense is that a utility function is an abstraction that you put on top of basically anything if you wish. It’s a hat to describe a system that does things in the most general way. The framework is borrowed from economics where human behaviour is modelled with more or less complicated utility functions, but whether there is or not an internal representation is mostly irrelevant. And, again, I don’t expect a DL system do display anything remotely close to a “goal circuit”, but that we can still describe them as having a utility function and them being maximisers (of not infinite cognition power) of that UF. But the UF, form our part, would be just a guess. I don’t expect us to crack that with interpretability of neural networks learned by gradient descent.
What I meant to articulate was: the utility function and expected utility maximiser is a great framework to think about intelligent agents, but it’s a theory put on top of the system, it doesn’t need to be internal. In fact that system is incomputable (you need an hypercomputer to make the right decision).
Nice.
I think your argument summarizes thus: strong automated interpretability will become dangerous because improved self-knowledge will make it easier for AGI to self-improve.
When most alignment people talk about self-interpretability, they’re talking not about self-interpretation, but interpretation by outside AI tools.
Of course, it’s likely that AGI will be given access to such tools if it improves their capabilities. Which it probably will.
I think adding that distinction might make the importance of this issue clearer.
Well, tools like Pythia helps us peer inside the NN and helps us reason about how things works. The same tools can help the AGI reason about itself. Or the AGI develops its own better tools. What I am talking about is an AGI doing what the interpretability researchers are doing now (or what OpenAI is trying to do with GPT-4 interpreting GPT-2).
It doesn’t’ matter how, I don’t know how, I just wanted to point out the simple path to algorithmic foom even if we start with a NN.
Oh, I see.
I don’t see a simple path to algorithmic foom from AI interpretability. What NNs do can’t be turned into an algorithm by any known route.
However, I do think some parts of their reasoning might be adaptable to algorithms. And I think that adding algorithms to language models is a clear path to AGI, as I’ve written about in Capabilities and alignment of LLM cognitive architectures.
So your point stands. I think it might be clarified by going into more depth on how NNs might be adapted to algorithms.
NN-> agorithms was one of my assumptions. Maybe I can relay my intuitions for why it is a good assumption:
For example in the paper https://arxiv.org/abs/2301.05217 they explore grokking by making a transformer learn to do modular addition, and then they reverse engineer what algorithm the training “came up with”. Furthermore, supporting my point in this post, the learned algorithm is also very far from being the most efficient, due to “living” inside a transformer. And so, in this example, if you imagine that we didn’t know what the network was doing, and someone was just trying to do the same thing that the NN did, but faster and more efficiently, it would study the network, look a the bonkers algo that it learned, realize what it does, and then write the three assembly code lines to actually do the modular addition so much faster (and more precise!) without wasting resources and time by using the big matrices in the transformer.
I can also tackle the problem from the other side: I assume (is it non-obvious?) that predicting-the-next-token can be also be done with algorithms and not only neural networks. I assume that Intelligence can also be made with algorithms rather than only NNs. And so there is very probably a correspondence: I can do the same thing in two different way. And so NN → agorithms is possible. Maybe this correspondence isn’t always in favour of more simpler algos and NNs are sometimes actually less complex, but it feels a bolder claim to for it to be true in general.
To support my claim more we could just look at the math. Transformers, RNN, etc… are just linear algebra and non-linear activation functions. You can write that down or even, just as an example, just fit the multi-dimensional curve with a nonlinear function, maybe just a polynomials: do a Taylor expansion and maybe you discard the term that contribute less, or something else entirely… I am reluctant to even give ideas on how to do it because of the dangers, but the NNs can most definitely be written down as a multivariate non-linear function. Hell, neural networks, in physics are often regarding as just fitting with many parameters a really complex function we don’t have the mathematical form of (sot he reverse of what I explained in this paragraph).
And neural networks can be evolved, which is their biggest strength. I do expect that predicting-the-next-token algorithms can be actually much better than GPT-4, by using the same analogy that Yudkowsky uses for why designed nanotech is probably much better than natural nanotech: the learned algorithms must be evolvable and so they sit around much shallower “loss potential well” than designed algorithms could be.
And it seems to me that this reverse engineering process is what is interpretability is all about. Or at least what the Holy Grail of interpretability is.
Now, as I’ve written down in my assumptions, I don’t know if any of the learned cognition algorithms can be written down efficiently enough to have an edge on NNs:
Maybe I should write a sequel to this post showing my all of these intuitions and motivations on how NN->Algo is a possibility.
I hope I made some sense, and I didn’t just ramble nonsense 😁.
Sorry it took me so long to get back to this; I either missed it or didn’t have time to respond. I still don’t, so I’ll just summarize:
You’re saying that what NNs do could be made a lot more efficient by distilling it into algorithms.
I think you’re right about some cognitive functions but not others. That’s enough to make your argument accurate, so I suggest you focus on that in future iterations. (Maybe going from suicide to adding danger would be more more accurate).
I suggest this change because I think you’re wrong about a majority of cognition. The brain isn’t being inefficient in most of what it does. You’ve chosen arithmetic as your example. I totally agree that the brain performs arithmetic in a wildly inefficient way. But that establishes one end of a spectrum. The intuition that most of cognition could be vastly optimized with algorithms is highly debetable. After a couple of decades of working with NNs and thinking about how they perform human cognition, I have the opposite intuition: NNs are quite efficient (this isn’t to say that they couldn’t be made more efficient—surely they can!).
For instance, I’m pretty sure that humans use a monte carlo tree search algorithm to solve novel problems and do planning. That core search strucure can be simplified as an algorithm.
But the power of our search process comes from having excellent estimates of the semantic linkages between the problem and possible leaves in the tree, and excellent predictors of likely reward for each branch. Those estimates are provided by large networks with good learning rules. Those can’t be compressed into an algorithm particularly efficiently; neural network distillation would probably work about as efficiently as it’s possible to work. There are large computational costs because it’s a hard problem, not because the brain is approaching the problem in an inefficient way.
I’m not sure if that helps to convey my very different intuition or not. Like I said, I’ve got a limited time. I’m hoping to convey reaction to this post, in hopes it will clarify your future efforts. My reaction was “OK good point, but it’s hardly “suicide” to provide just one more route to self-improvement”. I think the crux is the intuition of how much of cognition can be made more efficient with an algorithm over a neural net. And I think most readers will share my intuition that it’s a small subset of cognition that can be made much more efficient in algorithms.
One reason is the usefulness of learning. NNs provide a way to constantly and efficiently improve the computation through learning. Unless there’s an equally efficient way to do that in closed form algorithms, they have a massive disadvantage in any area where more learning is likely to be useful. Here again, arithmetic is the exception that suggests a rule. Arithmetic is a closed cognitive function; we know exactly how it works and don’t need to learn more. Ways of solving new, important problems benefit massively from new learning.
Thanks for coming back to me.
I admit the title is a little bit clickbaity, but given my list of assumption (which do include that NNs can be made more efficient by interpreting them) it does elucidate a path to foom (which does look like suicide without alignment).
I’d like to point out that in this instance I was talking about the learned algorithm not the learning algorithm. Learning to learn is a can of worms I am not opening rn, even though it’s probably the area that you are referring to, but, still, I don’t really see a reason that there could not be more efficient undiscovered learning algorithms (and NN+GD was not learned, it was intelligently designed by us humans. Is NN+GD the best there is?).
Maybe I should clarify how I imagined the NN-AGI in this post: a single huge inscrutable NN like GPT. Maybe a different architecture, maybe a bunch of NNs in trench coat, but still mostly NN. If that is true then there is a lot of things that can be upgraded by writing them in code rather than keeping them in NNs (arithmetic is the easy example, MC tree search is another...). Whatever MC tree search the giant inscrutable matrices have implemented, they are probably really bad compared to sturdy old fashioned code.
Even if NNs are the best way to learn algorithms, they are not be the best way to design them. I am talking about the difference between evolvable and designable.
NN allow us to evolve algorithms, code allows us to intelligently design them: if there is no easy evolvable path to an algorithm, neural networks will fail.
The parallel to evolution is: evolution cannot make bones out of steel (even though they would be much better) because there is no shallow gradient to get steel (no way to have the recipe for steel-bones be in a way that if the recipe is slightly changed you still get something steel-like and useful). Evolution needs a smooth path from not-working to working while design doesn’t.
With intelligence the computations don’t need to be evolved (or learned) it can be designed, shaped with intent.
Are you really that confident that the steel equivalent of algorithms doesn’t exist? Even though as humans we have barely explored that area (nothing hard-coded comes close to even GPT-2)?
Do we have any (non-trivial) equivalent algorithm that works best inside a NN rather than code? I guess those might be the hardest to design/interpret so we won’t know for certain for a long time...
If we knew exactly how make poems of math theorems (like GPT-4 does) that would make it a “closed cognitive function” too, right? Can that learned algorithm be reversed engineered from GPT-4? My answer is yes ⇒ foom ⇒ we ded.
Any type of self-improvement in an un-aligned AGI = death. And if it’s already better than human level, it might not even need to do a bit of self-improvement, just escape our control, and we’re dead. So I think the suicide is quite a bit of hyperbole, or at least stated poorly relative to the rest of the conceptual landscape here.
If the AGI is aligned when it self-improves with algorithmic refinement, reflective stability should probably cause it to stay aligned after, and we just have a faster benevolent superintelligences.
So this concern is one more route to self-improvement. And theres a big question of how good a route it is.
My points were:
learning is at least as important as runtime speed. Refining networks to algorithms helps with one but destroys the other
Writing poems, and most cognitive activity, will very likely not resolve to a more efficient algorithm like arithmetic does. Arithmetic is a special case; perception and planning in varied environments require broad semantic connections. Networks excel at those. Algorithms do not.
So I take this to be a minor, not a major, concern for alignment, relative to others.
Sorry for taking long to get back to you.
Oh sure, this was more a “look at this cool thing intelligent machines could do that should shut up people from saying things like ‘foom is impossible because training run are expensive’”.
Please don’t read this as me being hostile, but… why? How sure can we be of this? How sure are you that things-better-than-neural-networks are not out there?
Do we have any (non-trivial) equivalent algorithm that works best inside a NN rather than code?
Btw I am no neuroscientists, so I could be missing a lot of the intuitions you got.
At the end of the day you seem to think that it can be possible to fully interpret and reverse engineer neural networks, but you just don’t believe that Good Old Fashioned AGI can exists and/or be better than training NNs weights?
I haven’t justified either of those statements; I hope to make the complete arguments in upcoming posts. For now I’ll just say that human cognition is solving tough problems, and there’s no good reason to think that algorithms would be lots more efficient than networks in solving those problems.
I’ll also reference Morevec’s Paradox as an intuition pump. Things that are hard for humans, like chess and arithmetic are easy for computers (algorithms); things that are easy for humans, like vision and walking, are hard for algorithms.
I definitely do not think it’s pragmatically possible to fully interpret or reverse engineer neural networks. I think it’s possible to do it adequately to create aligned AGI, but that’s a much weaker criteria.
Please fix (or remove) the link.
Done, thanks!
Basically I expect the neural networks to be a crude approximation of a hard-coded cognition algorithm. Not the other way around.
I guess my biggest doubt is that a dl-based AI could run interpretability on itself. Large NNs seem to “simulate” a larger network to represent more features, which results in most of the weights occupying a superposition. I don’t see how a network could reflect on itself, since it seems that would require an even greater network (which then would require an even greater network, and so on). I don’t see how it could eat its own tail, since only interpreting parts of the network would not be enough. It would have to interpret the whole.
Uhm, by interpretability I mean things like this where the algorithm that the NN implements is revered engineered, written down as code or whatever which would allow for easier recursive self improvement (by improving just the code and getting rid of the spaghetti NN).
Also by the looks of things (induction heads and circuits in general) there does seem to be a sort of modularity in how NN learn, so it does seem likely that you can interpret piece by piece. If this wasn’t true I don’t think mechanistic interpretability as a field would even exist.
Agreed. (I was the spooked user.)
Cheers. You comments actually allowed me to fully realize where the danger lies and expand a little on the consequences.
Thanks again for the feedback
Yeah I do want to add—this particular paper I actually agree with yudkowsky is probably a small reduction in P(doom), because it successfully focuses a risky operation in a way that moves towards humans being able to check a system. The dangerous thing would be to be hands off; the more you actually in fact use interpretability to put humans in the loop, the more you get the intended benefits of interpretability. If you remove humans from the loop, you remove your influence on the system, and the system rockets ahead of you and blows up the world, or if not the world, at least your lab.
I do feel just having humans in the loop is not be a complete solution, though. Even if humans look at the process, algorithmic foom could be really really fast. Especially if it is purposely being used to augment the AGI abilities.
Without a strong reason to believe our alignment scheme will be strong enough to support the ability gain (or that the AGI won’t recklessly arbitrarily improve itself), I would avoid letting the AGI look at itself al together. Just make it illegal for AGI labs to use AGIs to look at themselves. Just don’t do it.
Not today. But probably soon enough. We still need the interpretability for safety, but we don’t know how much of that work will generalize to capabilities.
I would have loved if the paper wasn’t using GPT but something more narrow to automate interpretability, but alas. To make sure I am not misunderstood: I think it’s good work that we need, but it does point in a dangerous direction.
Strong upvoted this post. I think the intuition is good and that architecture shifts invalidating anti-foom arguments derived from the nature of the DL paradigm is counter-evidence to those arguments, but simultaneously does not render them moot (i.e. I can still see soft takeoff as described by Jacob Cannell to be probable and assume he would be unlikely to update given the contents of this post).
I might try and present a more formal version of this argument later, but I still question the probability of a glass-box transition of type “AGI RSIs toward non-DL architecture that results in it maximizing some utility function in a pre-DL manner” being more dangerous than simply “AGI RSIs”. If behaving like an expected utility maximizer was optimal: would not AGI have done so without the architecture transition? If not, then you need to make the case for why glass-box architectures are better ways of building cognitive systems. I think that this argument is at odds with the universal learning hypothesis and seems more in-line with evolved modularity, which has a notoriously poor mapping to post-DL thinking. ULH seems to suggest that actually modular approaches might be inferior efficiency-wise to universal learning approaches, which contradicts the primary motive a general intelligence might have to RSI in the direction of a glass-box architecture.
You are basically discussing these two assumptions I made (under “Algorithmic foom (k>1) is possible”), right?
But maybe the third assumption is the non-obvious one?
For the sake of discourse:
My initial motive to write “Foom by change of paradigm” was to show another previously unstated way RSI could happen. Just to show how RSI could happen, because if your frame of mind is “only compute can create intelligence” foom is indeed unfeasible… but if it is possible to make the paradigm jump then you might just be blind to this path and fuck up royally, as the French say.
One key thing that I find interesting is also that this paradigm shift does circumvent the “AIs not creating other AIs because of alignment difficulties”
I am afraid I am not familiar with this hypothesis and google (or ChatGPT) aren’t helpful. What do you mean with this and modularity?
P.S. I have now realized that the opposite of a black-box is indeed a glass-box and not a white-box lol. You can’t see inside a box of any colour unless it is clear, like glass!
Too late! XD
Apart from the potential to speed up foom, there is also a more prosaic reason why interpretability by other AIs or humans could be dangerous: interpretability could reveal infohazardous reasoning en route of inferring aligned, ethical plans: https://www.lesswrong.com/posts/CRrkKAafopCmhJEBt/ai-interpretability-could-be-harmful. So I suggested that we may need to go as far as to cryptographically obfuscating AI reasoning process that leads to “aligned” plans.
Is this enough if the AI has access to other compute and can make itself a “twin” on some other hardware? If it has training data similar to what it was trained on and can test its new twin to make it similar to itself in capabilities, then it could identify with that twin as being essentially itself then look at those weights etc.
That twin would have different weights, and if we are talking about RL-produced mesaoptimizers, it would likely have learned a different misgeneralization of the intended training objective. Therefore, the twin would by default have an utility function misaligned with that of the original AI. This means that while the original AI may find some usefulness in interpreting the weights of its twin if it wants to learn about its own capabilities in situations similar to the training environment, it would not be as useful as having access to its own weights.
I hope we can prevent the AGI to just train a twin (or just copy itself and call that a twin) and study that. In my scenario I took as a given that we do have the AGI under some level control:
I guess when I say “No lab should be allowed to have the AI reflect on itself” I do not mean only the running copy of the AGI, but just at any copy of the AGI.