AI notkilleveryoneism researcher, focused on interpretability.
Personal account, opinions are my own.
I have signed no contracts or agreements whose existence I cannot mention.
AI notkilleveryoneism researcher, focused on interpretability.
Personal account, opinions are my own.
I have signed no contracts or agreements whose existence I cannot mention.
I don’t think so. You only need one alien civilisation in our light cone to have preferences about the shape of the universal wave function rather than their own subjective experience for our light cone to get eaten. E.g. a paperclip maximiser might want to do this.
Also, the fermi paradox isn’t really a thing.
No, because getting shot has a lot of outcomes that do not kill you but do cripple you. Vacuum decay should tend to have extremely few of those. It’s also instant, alleviating any lingering concerns about identity one might have in a setup where death is slow and gradual. It’s also synchronised to split off everyone hit by it into the same branch, whereas, say, a very high-yield bomb wired to a random number generator that uses atmospheric noise would split you off into a branch away from your friends.[1]
I’m not unconcerned about vacuum decay, mind you. It’s not like quantum immortality is all confirmed and the implications worked out well in math.[2]
They’re still there for you of course, but you aren’t there for most of them. Because in the majority of their anticipated experience, you explode.
Sometimes I think about the potential engineering applications of quantum immortality in a mature civilisation for fun. Controlled, synchronised civilisation-wide suicide seems like a neat way to transform many engineering problems into measurement problems.
Since I didn’t see it brought up on a skim: One reason me and some of my physicist friends aren’t that concerned about vacuum decay is many-worlds. Since the decay is triggered by quantum tunneling and propagates at light speed, it’d be wiping out earth in one wavefunction branch that has amplitude roughly equal to the amplitude of the tunneling, while the decay just never happens in the other branches. Since we can’t experience being dead, this wouldn’t really affect our anticipated future experiences in any way. The vacuum would just never decay from our perspective.
So, if the vacuum were confirmed to be very likely meta-stable, and the projected base rate of collapses was confirmed to be high enough that it ought to have happened a lot already, we’d have accidentally stumbled into a natural and extremely clean experimental setup for testing quantum immortality.
I disagreed with Gwern at first. I’m increasingly forced to admit there’s something like bipolar going on here
What changed your mind? I don’t know any details about the diagnostic criteria for bipolar besides those you and Gwern brought up in that debate. But looking at the points you made back then, it’s unclear to me which of them you’d consider to be refuted or weakened now.
Musk’s ordinary behavior—intense, risk-seeking, hard-working, grandiose, emotional—does resemble symptoms of hypomania (full mania would usually involve psychosis, and even at his weirdest Musk doesn’t meet the clinical definition for this).
But hypomania is usually temporary and rare. A typical person with bipolar disorder might have hypomania for a week or two, once every few years. Musk is always like this. Bipolar disorder usually starts in one’s teens. But Musk was like this even as a child.
....
His low periods might meet criteria for a mixed episode. But a bipolar disorder that starts in childhood, continues all the time, has no frank mania, and has only mixed episodes instead of depression—doesn’t really seem like bipolar disorder to me. I’m not claiming there’s nothing weird about him, or that he doesn’t have extreme mood swings. I’m just saying it is not exactly the kind of weirdness and mood swings I usually associate with bipolar.
...
I notice the non-psychiatrists (including very smart people I usually trust) lining up on one side, and the psychiatrists on the other. I think this is because Musk fits a lot of the explicit verbally described symptoms of the condition, but doesn’t resemble real bipolar patients.
...
This isn’t how I expect bipolar to work. There is no “switch flipping” (except very occasionally when a manic episode follows directly after a depressive one). A patient will be depressed for weeks or months, then gradually come out of it, and after weeks or months of coming out of it, get back to normal. Being “moody” in the sense of having mood swings is kind of the opposite of bipolar; I would associate it more with borderline or PTSD.
Based on my understanding of what you are doing, the statement in the OP that in your setting is “sort of” K-complexity is a bit misleading?
Yes, I guess it is. In my (weak) defence, I did put a ‘(sort of)’ in front of that.
In my head, the relationship between the learning coefficient and the K-complexity here seems very similar-ish to the relationship between the K-complexities of a hypothesis expressed on two different UTMs.
If we have a UTM and a different UTM , we know that , because if nothing else we can simulate UTM on UTM and compute on the simulated . But in real life, we’d usually expect the actual shortest program that implements on to not involve jumping through hoops like this.
In the case of translating between a UTM and a different sort of Turing-complete model of computation, namely a recurrent neural network[1], I was expecting a similar sort of dynamic: If nothing else, we can always implement on the NN by simulating a UTM, and running on that simulated UTM. So the lowest LLC parameter configuration that implements on the NN has to have an LLC that is as small or smaller as the LLC of a parameter configuration that implements through this simulation route. Or that was the intuition I had starting out anyway.
If I understand correctly you are probably doing something like:
Seems broadly right to me except:
Third bullet point: I don’t know what you mean by a “smooth relaxation” precisely. So while this sounds broadly correct to me as a description of what I do, I can’t say for sure.
Sixth bullet point: You forgot the offset term for simulating the UTM on the transformer. Also, I think I’d get a constant prefactor before . Even if I’m right that the prefactor I have right now could be improved, I’d still expect at least a here.
I’d caution that the exact relation to the learning coefficient and the LLC is the part of this story I’m still the least confident about at the moment. As the intro said
This post is my current early-stage sketch of the proof idea. Don’t take it too seriously yet. I’m writing this out mostly to organise my own thoughts.
I’ve since gotten proof sketches for most of the parts here, including the upper bound on the LLC, so I am a bit more confident now. But they’re still hasty scrawlings.
you are treating the iid case
I am not sure whether I am? I’m a bit unclear on what you mean by iid in this context exactly. The setup does not seem to me to require different inputs to be independent of each other. It does assume that each label is a function of its corresponding input rather than some other input. So, label can depend on input , but it can only depend on in a manner mediated by . In other words, the joint probability distribution over inputs can be anything, but the labels must be iid conditioned on their inputs. I think. Is that what you meant?
From your message it seems like you think the global learning coefficient might be lower than , but that locally at a code the local learning coefficient might be somehow still to do with description length? So that the LLC in your case is close to something from AIT. That would be surprising to me, and somewhat in contradiction with e.g. the idea from simple versus short that the LLC can be lower than “the number of bits used” when error-correction is involved (and this being a special case of a much broader set of ways the LLC could be lowered).
I have been brooding over schemes to lower the bound I sketched above using activation error-correction blocks. Still unclear to me at this stage whether this will work or not. I’d say this and the workability of other schemes to get rid of the prefactactor to in the bound are probably the biggest source of uncertainty about this at the moment.
If schemes like this work, the story here probably ends up as something more like ′ is related to the number of bits in the parameters we need to fix to implement on the transformer.′
In that case, you’d be right, and the LLC would be lower, because in the continuum limit we can store an arbitrary number of bits in a single parameter.
I think I went into this kind of expecting that to be true. Then I got surprised when using less than one effective parameter per bit of storage in the construction turned out to be less straightforward than I’d thought once I actually engaged with the details. Now, I don’t know what I’ll end up finding.
Well, transformers are not actually Turing complete in real life where parameters aren’t real numbers, because if you want an unbounded context window to simulate unbounded tape, you eventually run out of space for positional encodings. But the amount of bits they can hold in memory does grow exponentially with the residual stream width, which seems good enough to me. Real computers don’t have infinite memory either.
Kind of? I’d say the big difference are
Experts are pre-wired to have a certain size, components can vary in size from tiny query-key lookup for a single fact to large modules.
IIRC, MOE networks use a gating function to decide which experts to query. If you ignored this gating and just use all the experts, I think that’d break the model. In contrast, you can use all APD components on a forward pass if you want. Most of them just won’t affect the result much.
MOE experts don’t completely ignore ‘simplicity’ as we define it in the paper though. A single expert is simpler than the whole MOE network in that it has lower rank/ fewer numbers are required to describe its state on any given forward pass.
Why would this be restricted to cyber attacks? If the CCP believed that ASI was possible, even if they didn’t believe in the alignment problem, the US developing an ASI would plausibly constitute an existential threat to them. It’d mean they lose the game of geopolitics completely and permanently. I don’t think they’d necessarily restrict themselves to covert sabotage in such a situation.
The possibility of stability through dynamics like mutually assured destruction has been where a lot of my remaining hope on the governance side has come from for a while now.
A big selling point of this for me is that it does not strictly require countries to believe that ASI is possible and that the alignment problem is real. Just believing that ASI is possible is enough.
Because it’s actually not very important in the limit. The dimensionality of V is what matters. A 3-dimensional sphere in the loss landscape always takes up more of the prior than a 2-dimensional circle, no matter how large the area of the circle is and how small the volume of the sphere is.
In real life, parameters are finite precision floats, and so this tends to work out to an exponential rather than infinite size advantage. So constant prefactors can matter in principle. But they have to be really really big.
Yes. I think this may apply to basically all somewhat general minds.
Doesn’t exist.[1] If is finite, you can insert AIT-style inequalities into the posterior to get bounds like the one I wrote above. This is neat if you e.g. have more than datapoints.
If f is infinite, you probably want to expand in instead. I haven’t done that yet, but I expect to get a bound that looks a lot like the standard free energy formula, with the K-complexity terms in the bound I wrote above showing up where the learning coefficient would usually be. The prefactor probably gets swapped out for a .
It’d still be an upper bound, not an equality, just as in AIT. The learning coefficient can still be smaller than this. This makes sense to me. There might be less complicated ways for the transformer to make an efficient prediction than simulating a UTM and running some program on it.
Except for the implicit dependence in and , since those are the KL-divergences summed over datapoints.
You either think of the NN weights as a countable set (by e.g. truncating precision “as in real life”) in which case you get something like but this is sort of weak sauce: you get this for any prior you want to put over your discrete set of NN weights, no implied connection to K-complexity unless you put one in by hand by taking .
No, you don’t need to put in by hand. A uniform prior over NN weights does the job.[1]
The trick is that a transformer run in recurrent mode can
Simulate a (time and space bounded) UTM in a few transformer blocks
Use the other transformer blocks to store program code to feed that UTM as input.
A uniform prior over neural network parameters then effectively implies a uniform prior over programs to run on the simulated UTM, modulo the bit specification cost of the UTM simulator and the storage setup. Because for every bit of program code less we need to store, we get degrees of freedom in the weights.
Since induction with a uniform prior on the input strings to a plain monotone UTM effectively gets us a weighting of hypotheses that’s exponential in K-complexity, we’ll get an error bound with a term proportional to , plus an offset term for specifying the UTM and storage in the transformer weights.
For the sake of concreteness: If I partially adapted your notation, and went to the special case where the data-generating process is exactly realisable in the weights of the transformer[2], I’d currrently seem to get a bound .[3]
Here, is the number of bits per neural network parameter[4], is the number of parameters needed to implement the UTM on the recurrent transformer architecture , is the K-complexity of data-generating program on the UTM in bits, and is the width of the residual stream.
The prefactor is there because my current construction is stupidly inefficient at storing program code in the weights. I think it ought to be possible to do better, and get this down to a 1. Don’t quote me on that though, I don’t have a proof yet.
If we don’t assume realisability, we can instead take any ‘efficient predictor’ program that is realisable on the transformer, and get the bound
.
So to summarise
K-complexity usually enters the theory via a choice of prior, and in continuous model classes priors show up in the constant order terms of asymptotic expansions in .
The result here is exactly that we don’t need to put in the K-complexity[5] via choice of prior. If we’re using a recurrent neural network, the K-complexity is in the prior already, just as it is on a plain monotone UTM. The architecture itself is what implements the bias toward simplicity.
Note also that in the case of continuous parameters, so bits per float going to infinity, the K-complexity terms in the bound do not become constant order terms, because they have as a prefactor. This is one way to start seeing that the K-complexity and the learning coefficient are pretty directly related quantities in the setting of recurrent neural networks.
I expect a Gaussian prior or anything else of the sort probably works as well, and yields a nigh-identical bound. But I haven’t shown that yet.[6]
My actual bound doesn’t need that assumption. Getting rid of the realisability assumption is what the effective predictor stuff is all about.
The terms become increasingly irrelevant as float precision gets larger. Basically, I’m using large negative biases to zero out storage neurons that are not needed. In the continuum limit, this would make the weights connecting to those neurons degenerate, and we could integrate them out of the measure. But since we’re in the discrete setting, we have to keep track of the fact that very large magnitudes of the weights that overwhelm the negative biases and switch the neuron on again aren’t allowed. This makes our volume of allowed parameter configurations just a little bit smaller.
So, for 8-bit floats, for 16-bit floats, etc. .
Defined relative to a time and space bounded universal Turing machine.
EDIT: As in I haven’t shown it in the case of finite float precision NN parameters yet. It of course straightforwardly follows in the SLT setting where NN parameters are real numbers and we consider the limit of number of datapoints going to infinity. The shape of the prior can’t matter much there, as you say.
The intellectual maturation between ages 18 and 20 is profound
This is the first time I’ve heard this claim. Any background/cites I should look into for this?
‘Local volume’ should also give a kind of upper bound on the LLC defined at finite noise though, right? Since as I understand it, what you’re referring to as the volume of a behavioral region here is the same thing we define via the behavioural LLC at finite noise scale in this paper? And that’s always going to be bigger or equal to the LLC taken at the same point at the same finite noise scale.
How does the performance of this compare to the SGLD sampling approach used by Timaeus, or to bounding the volume by just calculating the low-lying parts of the Hessian eigenspectrum? Or, to go even hackier and cheaper, just guessing the Hessian eigenspectrum with kfac-approximation by doing a PCA of the activations and gradients at every layer and counting the zero eigenvalues of those?
(For all of those approaches, I’d use the loss landscape/Hessian of the behavioural loss defined in section 2.2 of that last link, since you want to measure the volume of a behavioural region.)
There is a reason that paragraph says
I claim one reason we don’t know how to do that is that ’strawberry’ and ‘not destroying something’ are fuzzy abstract concepts that live in our heads
rather than
I claim the reason we don’t know how to do that is that ’strawberry’ and ‘not destroying something’ are fuzzy abstract concepts that live in our heads
My claim here is that good mech interp helps you be less confused about outer alignment[1], not that what I’ve sketched here suffices to solve outer alignment.
Outer alignment in the wider sense of ‘the problem of figuring out what target to point the AI at’.
My theory of impact for interpretability:
I’ve been meaning to write this out properly for almost three years now. Clearly, it’s not going to happen. So, you’re getting an improper quick and hacky version instead.
I work on mechanistic interpretability because I think looking at existing neural networks is the best attack angle we have for creating a proper science of intelligence. I think a good basic grasp of this science is a prerequisite for most of the important research we need to do to align a superintelligence to even get properly started. I view the kind of research I do as somewhat close in kind to what John Wentworth does.
For example, one problem we have in alignment is that even if we had some way to robustly point a superintelligence at a specific target, we wouldn’t know what to point it at. E.g. famously, we don’t know how to write “make me a copy of a strawberry and don’t destroy the world while you do it” in math. Why don’t we know how to do that?
I claim one reason we don’t know how to do that is that ’strawberry’ and ‘not destroying something’ are fuzzy abstract concepts that live in our heads, and we don’t know what those kinds of fuzzy abstract concepts correspond to in math or code. But GPT-4 clearly understands what a ‘strawberry’ is, at least in some sense. If we understood GPT-4 well enough to not be confused about how it can correctly answer questions about strawberries, maybe we also wouldn’t be quite so confused anymore about what fuzzy abstractions like ‘strawberry’ correspond to in math or code.
Another problem we have in alignment is that we don’t know how to robustly aim a superintelligence at a specific target. To do that at all, it seems like you might first want to have some notion of what ‘goals’ or ‘desires’ correspond to mechanistically in real agentic-ish minds. I don’t expect this to be as easy as looking for the ‘goal circuits’ in Claude 3.7. My guess is that by default, dumb minds like humans and today’s AIs are too incoherent to have their desires correspond directly to a clear, salient mechanistic structure we can just look at. Instead, I think mapping ‘goals’ and ‘desires’ in the behavioural sense back to the mechanistic properties of the model that cause them might be a whole thing. Understanding the basic mechanisms of the model in isolation mostly only shows you what happens on a single forward pass, while ‘goals’ seem like they’d be more of a many-forward-pass phenomenon. So we might have to tackle a whole second chapter of interpretability there before we get to be much less confused about what goals are.
But this seems like a problem you can only effectively attack after you’ve figured out much more basic things about how minds do reasoning moment-to-moment. Understanding how Claude 3.7 thinks about strawberries on a single forward pass may not be sufficient to understand much about the way its thinking evolves over many forward passes. Famously, just because you know how a program works and can see every function in it with helpful comments attached doesn’t yet mean you can predict much about what the program will do if you run it for a year. But trying to predict what the program will do if you run it for a year without first understanding what the functions in it even do seems almost hopeless. So, we should probably figure out how thinking about strawberries works first.
To solve these problems, we don’t need an exact blueprint of all the variables in GPT-4 and their role in the computation. For example, I’d guess that a lot of the bits in the weights of GPT-4 are just taken up by database entries, memorised bigrams and trigrams and stuff like that. We definitely need to figure out how to decompile these things out of the weights. But after we’ve done that and looked at a couple of examples to understand the general pattern of what’s in there, most of it will probably no longer be very relevant for resolving our basic confusion about how GPT-4 can answer questions about strawberries. We do need to understand how the model’s cognition interfaces with its stored knowledge about the world. But we don’t need to know most of the details of that world knowledge. Instead, what we really need to understand about GPT-4 are the parts of it that aren’t just trigrams and databases and addition algorithms and basic induction heads and other stuff we already know how to do.
AI engineers in the year 2006 knew how to write a big database, and they knew how to do a vector search. But they didn’t know how to write programs that could talk, or understand what strawberries are, in any meaningful sense. GPT-4 can talk, and it clearly understands what a strawberry is in some meaningful sense. So something is going on in GPT-4 that AI engineers in the year 2006 didn’t already know about. That is what we need to understand if we want to know how it can do basic abstract reasoning.
People argue a lot about whether RLHF or Constitutional AI or whatnot would work to align a superintelligence. I think those arguments would be much more productive and comprehensible to outsiders[1] if the arguers agreed on what exactly those techniques actually do to the insides of current models. Maybe then, those discussions wouldn’t get stuck on debating philosophy so much.
And sure, yes, in the shorter term, understanding how models work can also help make techniques that more robustly detect whether a model is deceiving you in some way, or whatever.
Compared to the magnitude of the task in front of us, we haven’t gotten much done yet. Though the total number of smart people hours sunk into this is also still very small, by the standards of a normal scientific field. I think we’re doing very well on insights gained per smart person hour invested, compared to a normal field, and very badly on finishing up before our deadline.
But at least, poking at things that confused me about current deep learning systems has already helped me become somewhat less confused about how minds in general could work. I used to have no idea how any general reasoner in the real world could tractably favour simple hypotheses over complex ones, given that calculating the minimum description length of a hypothesis is famously very computationally difficult. Now, I’m not so confused about that anymore.
I hope that as we understand the neural networks in front of us more, we’ll get more general insights like that, insights that say something about how most computationally efficient minds may work, not just our current neural networks. If we manage to get enough insights like this, I think they could form a science of minds on the back of which we could build a science of alignment. And then maybe we could do something as complicated and precise as aligning a superintelligence on the first try.
The LIGO gravitational wave detector probably had to work right on the first build, or they’d have wasted a billion dollars. It’s not like they could’ve built a smaller detector first to test the idea, not on a real gravitational wave. So, getting complicated things in a new domain right on the first critical try does seem doable for humans, if we understand the subject matter to the level we understand things like general relativity and laser physics. That kind of understanding is what I aim to get for minds.
At present, it doesn’t seem to me like we’ll have time to finish that project. So, I think humanity should probably try to buy more time somehow.
Like, say, politicians. Or natsec people.
This explanation via counting argument doesn’t seem very satisfying to me, because hypotheses expressed in natural language strike me as highly expressive and thus likely highly degenerate, like programs on a (bounded) universal Turing machine, rather than shallow, inexpressive, and forming a non-degenerate basis for function space, like polynomials.
The space of programs automatically has exponentially more points corresponding to equivalent implementations of short, simple programs. So, if you choose a random program from the set of programs that both fit in your head and can explain the data, you’re automatically implementing a Solomonoff-style simplicity prior. Meaning that gathering all data first and inventing a random explanation to fit to it afterwards ought to work perfectly fine. The explanation would automatically not be overfitted and generalise well to new data by default.
I think whatever is going on here is instead related to the dynamics of how we invent hypotheses. It doesn’t seem to just amount to picking random hypotheses until we find one that explains the data well. What my head is doing seems much more like some kind of local search. I start at some point in hypothesis space, and use heuristics, logical reasoning, and trial and error to iteratively make conceptual changes to my idea, moving step by step through some kind of hypothesis landscape where distance is determined by conceptual similarity of some sort.
Since this is a local search, rather than an unbiased global draw, it can have all kinds of weird pathologies and failure modes, depending on the idiosyncracies of the update rules it uses. I’d guess that the human tendency to overfit when presented with all the data from the start is some failure mode of this kind. No idea what specifically is going wrong there though.
Yeah, the difference between what those papers show and what I need turned out to be a lot bigger than I thought. I ended up making my own construction instead.
This actually turned out to be the most time consuming part of the whole proof. The other steps were about as straightforward as they looked.
There may be a sense in which amplitude is a finite resource. Decay your branch enough, and your future anticipated experience might come to be dominated by some alien with higher amplitude simulating you, or even just by your inner product with quantum noise in a more mainline branch of the wave function. At that point, you lose pretty much all ability to control your future anticipated experience. Which seems very bad. This is a barrier I ran into when thinking about ways to use quantum immortality to cheat heat death.