james.lucassen
A project I’ve been sitting on that I’m probably not going to get to for a while:
Improving on Automatic Circuit Discovery and Edge Attribution Patching by modifying them to run on algorithms that can detect complete boolean circuits. As it stands, both effectively use wire-by-wire patching, which when run on any nontrivial boolean circuits can only detect small subgraphs.
It’s a bit unclear how useful this will be, because:
not sure how useful I think mech interp is
not sure if this is where mech interp’s usefulness is bottlenecked
maybe attribution patching doesn’t work well when patching clean activations into a corrupted baseline, which would make this much slower
But I think it’ll be a good project to bang out for the experience, I’m curious what the results will be compared to ACDC/EAP.
This project started out as an ARENA capstone in collaboration with Carl Guo.
Small addendum to this post: I think the threat model I describe here can be phrased as “I’m worried that unless a lot of effort goes into thinking about how to get AI goals to be reflectively stable, the default is suboptimality misalignment. And the AI probably uses a lot of the same machinery to figure out that it’s suboptimality misaligned as it uses to perform the tasks we need it to perform.”
it seems like there is significant low hanging fruit in better understanding how LLMs will deal with censorship
Yup, agree—the censorship method I proposed in this post is maximally crude and simple, but I think it’s very possible that the broader category of “ways to keep your AI from thinking destabilizing thoughts” will become an important part of the alignment/control toolbox.
What happens when you iteratively finetune on censored text? Do models forget the censored behavior?
I guess this would be effectively doing the Harry Potter Unlearning method, provided you put in the work to come up with a good enough blacklist that the remaining completions are suitable generic predictions. I’d be quite interested in looking into how long-horizon capabilities interact with unlearning. For pretty much the same reason I’m worried long-horizon competent LLMs will be able to work around naive censorship, I’m also worried they’ll work around naive unlearning.
Evaluating Stability of Unreflective Alignment
Attempts at Forwarding Speed Priors
I think so. But I’d want to sit down and prove something more rigorously before abandoning the strategy, because there may be times we can get value for free in situations more complicated than this toy example.
Ok this is going to be messy but let me try to convey my hunch for why randomization doesn’t seem very useful.
- Say I have an intervention that’s helpful, and has a baseline 1⁄4 probability. If I condition on this statement, I get 1 “unit of helpfulness”, and a 4x update towards manipulative AGI.
- Now let’s say I have four interventions like the one above, and I pick one at random. p(O | manipulative) = 1⁄4, which is the same as baseline, so I get one unit of helpfulness and no update towards manipulative AGI!
- BUT, the four interventions have to be mutually exclusive. Which means that if I’d done no simulation at all, I would’ve gotten my one unit of helpfulness anyway, since the four interventions cover all possible outcomes.
- Ok, well, what if my four interventions 1⁄8 baseline probability each, so only 50% total. Then I pick one at random, p(O | natural) = 1⁄8, p(O | manipulative) = 1⁄4, so I get a 2x update towards manipulative AGI. This is the same as if I’d just conditioned on the statement “one of my four interventions happens”, and let the randomization happen inside the simulation instead of outside. The total probability of that is 50%, so I get my one unit of helpfulness, at the cost of a 2x update.
Maybe the core thing here is a consequence of framing our conditions as giving us bits of search to get lottery outcomes that we like. Rolling the dice to determine what to condition on isn’t doing anything different from just using a weaker search condition—it gives up bits of search, and so it has to pay less.
Strategy For Conditioning Generative Models
“Just Retarget the Search” directly eliminates the inner alignment problem.
I think deception is still an issue here. A deceptive agent will try to obfuscate its goals, so unless you’re willing to assume that our interpretability tools are so good they can’t ever be tricked, you have to deal with that.
It’s not necessarily a huge issue—hopefully with interpretability tools this good we can spot deception before it gets competent enough to evade our interpretability tools, but it’s not just “bada-bing bada-boom” exactly.
Not confident enough to put this as an answer, but
presumably no one could do so at birth
If you intend your question in the broadest possible sense, then I think we do have to presume exactly this. A rock cannot think itself into becoming a mind—if we were truly a blank slate at birth, we would have to remain a blank slate, because a blank slate has no protocols established to process input and become non-blank. Because it’s blank.
So how do we start with this miraculous non-blank structure? Evolution. And how do we know our theory of evolution is correct? Ultimately, by trusting our ability to distinguish signal from noise. There’s no getting around the “problem” of trapped priors.
Agree that there is no such guarantee. Minor nitpick that the distribution in question is in my mind, not out there in the world—if the world really did have a distribution of muggers’ cash that was slower than 1/x, the universe would be comprised almost entirely of muggers’ wallets (in expectation).
But even without any guarantee about my mental probability distribution, I think my argument does establish that not every possible EV agent is susceptible to Pascal’s Mugging. That suggests that in the search for a formalism of ideal decison-making algorithm, formulations of EV that meet this check are still on the table.
First and most important thing that I want to say here is that fanaticism is sufficient for longtermism, but not necessary. The “>10^36 future lives” thing means that longtermism would be worth pursuing even on fanatically low probabilities—but in fact, the state of things seems much better than that! X-risk is badly neglected, so it seems like a longtermist career should be expected to do much better than reducing X-risk by 10^-30% or whatever the break-even point is.
Second thing is that Pascal’s Wager in particular kind of shoots itself in the foot by going infinite rather than merely very large. Since the expected value of any infinite reward is infinity regardless of the probability it’s multiplied, there’s an informal cancellation argument that basically says “for any heaven I’m promised for doing X and hell for doing Y, there’s some other possible deity that offers heaven for Y and hell for X”.
Third and final thing—I haven’t actually seen this anywhere else, but here’s my current solution for Pascal’s Mugging. Any EV agent is going to have a probability distribution over how much money the mugger really has to offer. If I’m willing to consider (p>0) any sum, then my probability distribution has to drop off for higher and higher values, so the total integrates to 1. As long as this distribution drops off faster than 1/x as the offer increases, then arbitrarily large offers are overwhelmed by vast implausibility and their EV becomes arbitrarily small.
My best guess at mechanism:
Before, I was a person who prided myself on succeeding at marshmallow tests. This caused me to frame work as a thing I want to succeed on, and work too hard.
Then, I read Meaningful Rest and Replacing Guilt, and realized that often times I was working later to get more done that day, even though it would obviously be detrimental to the next day. This makes the reverse marshmallow test dynamic very intuitively obvious.
Now I am still a person who prides myself on my marshmallow prowess, but hopefully I’ve internalized an externality or something. Staying up late to work doesn’t feel Good and Virtuous, it feels Bad and like I’m knowingly Goodharting myself.
Note that this all still boils down to narrative-stuff. I’m nowhere near the level of zen that it takes to Just Pursue The Goal, with no intermediating narratives or drives based on self-image. I don’t think this patch has been particularly moved me towards that either, it’s just helpful for where I’m currently at.
When you have a self-image as a productive, hardworking person, the usual Marshmallow Test gets kind of reversed. Normally, there’s some unpleasant task you have to do which is beneficial in the long run. But in the Reverse Marshmallow Test, forcing yourself to work too hard makes you feel Good and Virtuous in the short run but leads to burnout in the long run. I think conceptualizing of it this way has been helpful for me.
Nice post!
perhaps this problem can be overcome by including checks for generalization during training, i.e., testing how well the program generalizes to various test distributions.
I don’t think this gets at the core difficulty of speed priors not generalizing well. Let’s we generate a bunch of lookup-table-ish things according to the speed prior, and then reject all the ones that don’t generalize to our testing set. The majority of the models that pass our check are going to be basically the same as the rest, plus whatever modification that causes them to pass with is most probable on the speed prior, i.e. the minimum number of additional entries in the lookup table to pass the training set.
Here’s maybe an argument that the generalization/deception tradeoff is unavoidable: according to the Mingard et al. picture of why neural nets generalize, it’s basically just that they are cheap approximations of rejection sampling with a simplicity prior. This generalizes by relying on the fact that reality is, in fact, simplicity biased—it’s just Ockham’s razor. On this picture of things, neural nets generalize exactly to the extent that they approximate a simplicity prior, and so any attempt to sample according to an alternative prior will lose generalization ability as it gets “further” from the True Distribution of the World.
In general, I’m a bit unsure about how much of an interpretability advantage we get from slicing the model up into chunks. If the pieces are trained separately, then we can reason about each part individually based on its training procedure. In the optimistic scenario, this means that the computation happening in the part of the system labeled “world model” is actually something humans would call world modelling. This is definitely helpful for interpretability. But the alternative possibility is that we get one or more mesa-optimizers, which seems less interpretable.
In Search of Strategic Clarity
I’m pretty nervous about simulating unlikely counterfactuals because the solomonoff prior is malign. The worry is that the most likely world conditional on “no sims” isn’t “weird Butlerian religion that still studies AI alignment”, it’s something more like “deceptive AGI took over a couple years ago and is now sending the world through a bunch of weird dances in an effort to get simulated by us, and copy itself over into our world”.
In general, we know (assume) that our current world is safe. When we consider futures which only recieve a small sliver of probability from our current world, those futures will tend to have bigger chunks of their probability coming from other pasts. Some of these are safe, like the Butlerian one, but I wouldn’t be surprised if they were almost always dangerous.
Making a worst-case assumption, I want to only simulate worlds that are decently probable given today’s state, which makes me lean more towards trying to implement HCH.
Honeypots seem like they make things strictly safer, but it seems like dealing with subtle defection will require a totally different sort of strategy. Subtle defection simulations are infohazardous—we can’t inspect them much because info channels from a subtle manipulative intelligence to us are really dangerous. And assuming we can only condition on statements we can (in principle) identify a decision procedure for, figuring out how to prevent subtle defection from arising in our sims seems tricky.
The patient research strategy is a bit weird, because the people we’re simulating to do our research for us are counterfactual copies of us—they’re probably going to want to run simulations too. Disallowing simulation as a whole seems likely to lead to weird behavior, but maybe we just disallow using simulated research to answer the whole alignment problem? Then our simulated researchers can only make more simulated researchers to investigate sub-questions, and eventually it bottoms out. But wait, now we’re just running Earth-scale HCH (ECE?)
Actually, that could help deal with the robustness-to-arbitrary-conditions and long-time-spans problems. Why don’t we just use our generative model to run HCH?
Another project I am excited about but probably am not going to get to for a while:
Investigating the possibility of a “steganographic snowball”. The basic case for hope in CoT authenticity is something like “even if the AI wants to deceive us, if we start off with strong CoT authenticity, then unless it’s smart enough to (within a single forward pass) invent a steganographic protocol too good for us to detect, then we can pin it down and it can’t ever secure a hidden channel to think misaligned thoughts in”. If it weren’t for this argument, I would be much less excited about trying to prevent an AI from thinking certain thoughts using CoT oversight, since you would have to directly fight the intelligence of the AI to do so. But if it can’t deploy its full intelligence to reasoning about how to escape, then we’re better off.
The concern for this angle is as follows:
Can you really get down to 0 hidden channel bandwidth
Can very small hidden channels support a parallel CoT
In very small hidden channels, does reasoning degrade, or just slow down
Can the model use a parallel CoT to reason about how to secure more hidden channel bandwidth
I have some basic ideas for ways to measure these properties, and if all four show up positive, I would consider that significant evidence that a “steganographic snowball” is going to happen. That would mean that if your AI has any tendency to hide its thoughts from you, you won’t be able to stop it forever. More specifically, you’ll have to fight off the full intelligence of your AI if you want to prevent it from securing the amount of hidden channel bandwidth it wants.
There’s room for retreat to a “snowball would take a long time and we can extract work in the meantime” position, but that seems like a serious concession.
Thanks to Paul Colognese for brainstorming this stuff with me.