Physicist switching to AI alignment
Studying these man-made horrors so they are no longer beyond my comprehension
Physicist switching to AI alignment
Studying these man-made horrors so they are no longer beyond my comprehension
What NNs do can’t be turned into an algorithm by any known route.
NN-> agorithms was one of my assumptions. Maybe I can relay my intuitions for why it is a good assumption:
For example in the paper https://arxiv.org/abs/2301.05217 they explore grokking by making a transformer learn to do modular addition, and then they reverse engineer what algorithm the training “came up with”. Furthermore, supporting my point in this post, the learned algorithm is also very far from being the most efficient, due to “living” inside a transformer. And so, in this example, if you imagine that we didn’t know what the network was doing, and someone was just trying to do the same thing that the NN did, but faster and more efficiently, it would study the network, look a the bonkers algo that it learned, realize what it does, and then write the three assembly code lines to actually do the modular addition so much faster (and more precise!) without wasting resources and time by using the big matrices in the transformer.
I can also tackle the problem from the other side: I assume (is it non-obvious?) that predicting-the-next-token can be also be done with algorithms and not only neural networks. I assume that Intelligence can also be made with algorithms rather than only NNs. And so there is very probably a correspondence: I can do the same thing in two different way. And so NN → agorithms is possible. Maybe this correspondence isn’t always in favour of more simpler algos and NNs are sometimes actually less complex, but it feels a bolder claim to for it to be true in general.
To support my claim more we could just look at the math. Transformers, RNN, etc… are just linear algebra and non-linear activation functions. You can write that down or even, just as an example, just fit the multi-dimensional curve with a nonlinear function, maybe just a polynomials: do a Taylor expansion and maybe you discard the term that contribute less, or something else entirely… I am reluctant to even give ideas on how to do it because of the dangers, but the NNs can most definitely be written down as a multivariate non-linear function. Hell, neural networks, in physics are often regarding as just fitting with many parameters a really complex function we don’t have the mathematical form of (sot he reverse of what I explained in this paragraph).
And neural networks can be evolved, which is their biggest strength. I do expect that predicting-the-next-token algorithms can be actually much better than GPT-4, by using the same analogy that Yudkowsky uses for why designed nanotech is probably much better than natural nanotech: the learned algorithms must be evolvable and so they sit around much shallower “loss potential well” than designed algorithms could be.
And it seems to me that this reverse engineering process is what is interpretability is all about. Or at least what the Holy Grail of interpretability is.
Now, as I’ve written down in my assumptions, I don’t know if any of the learned cognition algorithms can be written down efficiently enough to have an edge on NNs:
[I assume that] algorithms interpreted from matrix multiplications are efficient enough on available hardware. This is maybe my shakiest hypothesis: matrix multiplication in GPUs is actually pretty damn well optimized
Maybe I should write a sequel to this post showing my all of these intuitions and motivations on how NN->Algo is a possibility.
I hope I made some sense, and I didn’t just ramble nonsense 😁.
Well, tools like Pythia helps us peer inside the NN and helps us reason about how things works. The same tools can help the AGI reason about itself. Or the AGI develops its own better tools. What I am talking about is an AGI doing what the interpretability researchers are doing now (or what OpenAI is trying to do with GPT-4 interpreting GPT-2).
It doesn’t’ matter how, I don’t know how, I just wanted to point out the simple path to algorithmic foom even if we start with a NN.
Disclaimer: These are all hard questions and points that I don’t know their true answers, these are just my views, what I have understood up to now. I haven’t studied the expected utility maximisers exactly because I don’t expect the abstraction to be useful for the kind of AGI we are going to be making.
There’s a huge gulf between agentic systems and “zombie-agentic” systems (that act like agents with goals, but have no explicit internal representation of those goals)
I feel the same, but I would say that it’s the “real-agentic” system (or a close approximation of it) that needs God-level knowledge of cognitive systems (why orthodox alignment by building the whole mind from theory is really hard). An evolved system like us or like GPT, IMO, seems more close to a “zombie-agentic” system.
I feel the key thing to understand each other might be coherence, and how coherence can vary from introspection, but I am not knowledgeable enough to delve into this right now.
How do you define the goal (or utility function) of an agent? Is it something that actually happens when universe containing the agent evolves in its usual physical fashion? Or is it something that was somehow intended to happen when the agent is run (but may not actually happen due to circumstances and agent’s shortcomings)?
The view in my mind that makes sense is that a utility function is an abstraction that you put on top of basically anything if you wish. It’s a hat to describe a system that does things in the most general way. The framework is borrowed from economics where human behaviour is modelled with more or less complicated utility functions, but whether there is or not an internal representation is mostly irrelevant. And, again, I don’t expect a DL system do display anything remotely close to a “goal circuit”, but that we can still describe them as having a utility function and them being maximisers (of not infinite cognition power) of that UF. But the UF, form our part, would be just a guess. I don’t expect us to crack that with interpretability of neural networks learned by gradient descent.
What I meant to articulate was: the utility function and expected utility maximiser is a great framework to think about intelligent agents, but it’s a theory put on top of the system, it doesn’t need to be internal. In fact that system is incomputable (you need an hypercomputer to make the right decision).
I feel the exact opposite! Creating something that seems to maximise something without having a clear idea of what its goal is really natural IMO. You said it yourself, GPT “”wants”″ to predict the correct probability distribution of the next token, but there is probably not a thing inside actively maximising for that, instead it’s very likely to be a bunch of weird heuristics that were selected by the training method because they work.
If you instead meant that GPT is “just an algorithm” I feel we disagree here as I am pretty sure that I am just an algorithm myself.
Look at us! We can clearly model a single human as to having a utility function (k maybe given their limited intelligence it’s actually hard) but we don’t know what our utility actually is. I think Rob Miles made a video about that iirc.
My understanding is that the utility function and expected utility maximiser is basically the theoretical pinnacle of intelligence! Not your standard human or GPT or near-future AGI. We are also quite myopic (and whatever near-future AGI we make will also be myopic at first).
Is it possible to have a system that just “actively try to make paperclips no matter what” when it runs, but it doesn’t reflect it in its reasoning and planning?
I’d say that it can reflect about its reasoning and planning, but it just plaster the universe with tiny molecular spirals because it just like that more than keeping humans alive.
I think this tweet by EY https://twitter.com/ESYudkowsky/status/1654141290945331200 shows what I mean. We don’t know what the ultimate dog is, we don’t know what we would have created if we did have the capabilities to make a dog-like thing from scratch. We didn’t create ice-cream because it maximise our utility function. We just stumbled on its invention and found that it is really yummy.
But I really don’t want to adventure myself in this, I am writing something similar to these points in order to deconfuse myself, it is not exactly clear to me the divide between agent meant in the theoretical sense and real systems.
So to keep the discussion on-topic, what I think is:
interpretability to “correct” the system: good, but be careful pls
interpretability for capabilities: bad
You are basically discussing these two assumptions I made (under “Algorithmic foom (k>1) is possible”), right?
The intelligence ceiling is much higher than what we can achieve with just DL
The ceiling of hard-coded intelligence that runs on near-future hardware isn’t particularly limited by the hardware itself: algorithms interpreted from matrix multiplications are efficient enough on available hardware. This is maybe my shakiest hypothesis: matrix multiplication in GPUs is actually pretty damn well optimized
Algorithms are easier to reason about than staring at NNs weights
But maybe the third assumption is the non-obvious one?
For the sake of discourse:
I still question [...] “AGI RSIs toward non-DL architecture that results in it maximizing some utility function in a pre-DL manner” being more dangerous than simply “AGI RSIs”
My initial motive to write “Foom by change of paradigm” was to show another previously unstated way RSI could happen. Just to show how RSI could happen, because if your frame of mind is “only compute can create intelligence” foom is indeed unfeasible… but if it is possible to make the paradigm jump then you might just be blind to this path and fuck up royally, as the French say.
One key thing that I find interesting is also that this paradigm shift does circumvent the “AIs not creating other AIs because of alignment difficulties”
I think that this argument is at odds with the universal learning hypothesis...
I am afraid I am not familiar with this hypothesis and google (or ChatGPT) aren’t helpful. What do you mean with this and modularity?
P.S. I have now realized that the opposite of a black-box is indeed a glass-box and not a white-box lol. You can’t see inside a box of any colour unless it is clear, like glass!
I hope we can prevent the AGI to just train a twin (or just copy itself and call that a twin) and study that. In my scenario I took as a given that we do have the AGI under some level control:
If no alignment scheme is in place, this type of foom is probably a problem we would be too dead to worry about.
I guess when I say “No lab should be allowed to have the AI reflect on itself” I do not mean only the running copy of the AGI, but just at any copy of the AGI.
Wouldn’t it at least solve corrigibility by making it possible to detect formation of undesirable end-goals? I think even GPT-4 can classify textual interpretation of an end-goal on a basis of its general desirability for humans.
I really don’t expect “goals” to be explicitly written down in the network. There will very likely not be a thing that says “I want to predict the next token” or “I want to make paperclips” or even a utility function of that. My mental image of goals is that they are put “on top” of the model/mind/agent/person. Whatever they seem to pursue, independently of their explicit reasoning.
Anyway, detecting goals, detecting deceit, detecting hidden knowledge of the system is a good thing to have. Interpretability of those things are needed. But interpretability cuts both ways, and with a full-interpretable AGI, foom seems to be a great danger. That’s what I wanted to point out. With a fast intelligence explosion (that doesn’t need slow retraining or multiple algorithmic breakthrough) the capabilities will explode alongside, while alignment won’t.
It seem to need another assumption, namely that the AGI have sufficient control of its internal state and knowledge of the detection network to be able to bring itself into a state that produces interpretation that trips detection network, while also allowing the AGI to advance its agenda.
It is not clear to me, what you are referring to, here. Do you think we will have detection networks? Detection for what? Deceit? We might literally have the AGI look inside for a purpose (like in the new OpenAI paper). I hope we have something like a thing that tells us if it wants to self-modify, but if nobody points out the danger of foom, we likely won’t have that.
I do feel just having humans in the loop is not be a complete solution, though. Even if humans look at the process, algorithmic foom could be really really fast. Especially if it is purposely being used to augment the AGI abilities.
Without a strong reason to believe our alignment scheme will be strong enough to support the ability gain (or that the AGI won’t recklessly arbitrarily improve itself), I would avoid letting the AGI look at itself al together. Just make it illegal for AGI labs to use AGIs to look at themselves. Just don’t do it.
Not today. But probably soon enough. We still need the interpretability for safety, but we don’t know how much of that work will generalize to capabilities.
I would have loved if the paper wasn’t using GPT but something more narrow to automate interpretability, but alas. To make sure I am not misunderstood: I think it’s good work that we need, but it does point in a dangerous direction.
Cheers. You comments actually allowed me to fully realize where the danger lies and expand a little on the consequences.
Thanks again for the feedback
Basically I expect the neural networks to be a crude approximation of a hard-coded cognition algorithm. Not the other way around.