james.lucassen
I want to flag this as an assumption that isn’t obvious. If this were true for the problems we care about, we could solve them by employing a lot of humans.
humans provides a pretty strong intuitive counterexample
Yup not obvious. I do in fact think a lot more humans would be helpful. But I also agree that my mental picture of “transformative human level research assistant” relies heavily on serial speedup, and I can’t immediately picture a version that feels similarly transformative without speedup. Maybe evhub or Ethan Perez or one of the folks running a thousand research threads at once would disagree.
such plans are fairly easy and don’t often raise flags that indicate potential failure
Hmm. This is a good point, and I agree that it significantly weakens the analogy.
I was originally going to counter-argue and claim something like “sure total failure forces you to step back far but it doesn’t mean you have to step back literally all the way”. Then I tried to back that up with an example, such as “when I was doing alignment research, I encountered total failure that forced me to abandon large chunks of planning stack, but this never caused me to ‘spill upward’ to questioning whether or not I should be doing alignment research at all”. But uh then I realized that isn’t actually true :/
We want particularly difficult work out of an AI.
On consideration, yup this obviously matters. The thing that causes you to step back from a goal is that goal being a bad way to accomplish its supergoal, aka “too difficult”. Can’t believe I missed this, thanks for pointing it out.
I don’t think this changes the picture too much, besides increasing my estimate of how much optimization we’ll have to do to catch and prevent value-reflection. But a lot of muddy half-ideas came out of this that I’m interested in chewing on.
Maybe I’m just reading my own frames into your words, but this feels quite similar to the rough model of human-level LLMs I’ve had in the back of my mind for a while now.
You think that an intelligence that doesn’t-reflect-very-much is reasonably simple. Given this, we can train chain-of-thought type algorithms to avoid reflection using examples of not-reflecting-even-when-obvious-and-useful. With some effort on this, reflection could be crushed with some small-ish capability penalty, but massive benefits for safety.
In particular, this reads to me like the “unstable alignment” paradigm I wrote about a while ago.
You have an agent which is consequentialist enough to be useful, but not so consequentialist that it’ll do things like spontaneously notice conflicts in the set of corrigible behaviors you’ve asked it to adhere to and undertake drastic value reflection to resolve those conflicts. You might hope to hit this sweet spot by default, because humans are in a similar sort of sweet spot. It’s possible to get humans to do things they massively regret upon reflection as long as their day to day work can be done without attending to obvious clues (eg guy who’s an accountant for the Nazis for 40 years and doesn’t think about the Holocaust he just thinks about accounting). Or you might try and steer towards this sweet spot by developing ways to block reflection in cases where it’s dangerous without interfering with it in cases where it’s essential for capabilities.
o7
I’m not sure exactly what mesa is saying here, but insofar as “implicitly tracking the fact that takeoff speeds are a feature of reality and not something people can choose” means “intending to communicate from a position of uncertainty about takeoff speeds” I think he has me right.
I do think mesa is familiar enough with how I talk that the fact he found this unclear suggests it was my mistake. Good to know for future.
Ah, didn’t mean to attribute the takeoff speed crux to you, that’s my own opinion.
I’m not sure what’s best in fast takeoff worlds. My message is mainly just that getting weak AGI to solve alignment for you doesn’t work in a fast takeoff.
“AGI winter” and “overseeing alignment work done by AI” do both strike me as scenarios where agent foundations work is more useful than in the scenario I thought you were picturing. I think #1 still has a problem, but #2 is probably the argument for agent foundations work I currently find most persuasive.
In the moratorium case we suddenly get much more time than we thought we had, which enables longer payback time plans. Seems like we should hold off on working on the longer payback time plans until we know we have that time, not while it still seems likely that the decisive period is soon.
Having more human agent foundations expertise to better oversee agent foundations work done by AI seems good. How good it is depends on a few things. How much of the work that needs to be done is conceptual breakthroughs (tall) vs schlep with existing concepts (wide)? How quickly does our ability to oversee fall off for concepts more advanced than what we’ve developed so far? These seem to me like the main ones, and like very hard questions to get certainty on—I think that uncertainty makes me hesitant to bet on this value prop, but again, it’s the one I think is best.
I’m on board with communicating the premises of the path to impact of your research when you can. I think more people doing that would’ve saved me a lot of confusion. I think your particular phrasing is a bit unfair to the slow takeoff camp but clearly you didn’t mean it to read neutrally, which is a choice you’re allowed to make.
I wouldn’t describe my intention in this comment as communicating a justification of alignment work based on slow takeoff? I’m currently very uncertain about takeoff speeds and my work personally is in the weird limbo of not being premised on either fast or slow scenarios.
Nice post, glad you wrote up your thinking here.
I’m a bit skeptical of the “these are options that pay off if alignment is harder than my median” story. The way I currently see things going is:
a slow takeoff makes alignment MUCH, MUCH easier [edit: if we get one, I’m uncertain and think the correct position from the current state of evidence is uncertainty]
as a result, all prominent approaches lean very hard on slow takeoff
there is uncertainty about takeoff speed, but folks have mostly given up on reducing this uncertainty
I suspect that even if we have a bunch of good agent foundations research getting done, the result is that we just blast ahead with methods that are many times easier because they lean on slow takeoff, and if takeoff is slow we’re probably fine if it’s fast we die.
Ways that could not happen:
Work of the of the form “here are ways we could notice we are in a fast takeoff world before actually getting slammed” produces evidence compelling enough to pause, or cause leading labs to discard plans that rely on slow takeoff
agent foundations research aiming to do alignment in faster takeoff worlds finds a method so good it works better than current slow takeoff tailored methods even in the slow takeoff case, and labs pivot to this method
Both strike me as pretty unlikely. TBC this doesn’t mean those types of work are bad, I’m saying low probability not necessarily low margins
Man, I have such contradictory feelings about tuning cognitive strategies.
Just now I was out on a walk, and I had to go up a steep hill. And I thought “man I wish I could take this as a downhill instead of an uphill. If this were a loop I could just go the opposite way around. But alas I’m doing an out-and-back, so I have to take this uphill”.
And then I felt some confusion about why the loop-reversal trick doesn’t work for out-and-back routes, and a spark of curiosity, so I thought about that for a bit.
And after I had cleared up my confusion, I was a happy with the result so I wrote down some of my chain of thought and looked over it. And there were many obvious places where I had made random jumps or missed a clue or otherwise made a mistake.
And this is kind of neat, in the way that looking for a more elegant proof once you’ve hacked a basic proof together is neat. And it really feels like if my mind took those cleaner, straighter paths by default, there are a lot of things I could get done a lot faster and better. But it also really feels like I could sit down and do those exercises and never really improve, just ratiocinate and navel gaze and build up complicated structures in my head that don’t actually cash out to winning.
I’m confused about EDT and Smoking Lesion. The canonical argument says:
1) CDT will smoke, because smoking can’t cause you to have the lesion or have cancer
2) EDT will not smoke, because people who smoke tend to have the lesion, and tend to have cancer.
I’m confused about 2), and specifically about “people who smoke tend to have the lesion”. Say I live in a country where everybody follows EDT. Then nobody smokes, and there is no correlation between smoking and having the lesion. Seems like the “people who smoke tend to have the lesion” is pumping a misleading intuition. Maybe mortals who make probabilistic decisions and smoke are then more likely to have the lesion. But it’s not true that EDT’ers who smoke tend to have the lesion, because EDT’ers never smoke.
Seems like EDT contains a conflict between “model your actions as probabilistic and so informative about lesion status” versus “model your actions as deterministic because you perfectly follow EDT and so not informative about lesion status”.
Aha wait I think I un-confused myself in the process of writing this. If we suppose that having the lesion guarantees you’ll smoke, then having the lesion must force you to not be an EDT’er, because EDT’ers never smoke. So the argument kind of has to be phrased in terms of “people who smoke” instead of “EDT’ers who smoke”, which sounds misleading at first, like it’s tricking you with the wrong reference class for the correlation. But actually it’s essential, because “EDT’ers who smoke” is a contradiction, and “people who smoke” roughly translates to “decision theories which smoke”. With a deterministic decision theory you can’t take action counterfactuals without a contradiction, so you have to counterfact on your whole decision theory instead.
Neat :)
Hey, thanks for the comment! Agree that this argument is good and important. The thing I’d want to investigate to deal with this is basically “in a small hidden channel, does cognition degrade or just slow down”.
If cognition degrades, then a small hidden channel would lead to poor attempts at steganography, which we could catch and suppress, preventing a snowball.
If cognition instead just slows down, then if the AI is smart enough to come up with steganography protocols too good for us to detect, it might successfully secure more bandwidth on its first try, starting a snowball.
Even in the slowdown case, there are fallback hopes like “it’s slow enough for us to extract work and predictably stop the episode before the snowball starts” or “we can obscure how good our steganography detection is”
Another project I am excited about but probably am not going to get to for a while:
Investigating the possibility of a “steganographic snowball”. The basic case for hope in CoT authenticity is something like “even if the AI wants to deceive us, if we start off with strong CoT authenticity, then unless it’s smart enough to (within a single forward pass) invent a steganographic protocol too good for us to detect, then we can pin it down and it can’t ever secure a hidden channel to think misaligned thoughts in”. If it weren’t for this argument, I would be much less excited about trying to prevent an AI from thinking certain thoughts using CoT oversight, since you would have to directly fight the intelligence of the AI to do so. But if it can’t deploy its full intelligence to reasoning about how to escape, then we’re better off.The concern for this angle is as follows:
Can you really get down to 0 hidden channel bandwidth
Can very small hidden channels support a parallel CoT
In very small hidden channels, does reasoning degrade, or just slow down
Can the model use a parallel CoT to reason about how to secure more hidden channel bandwidth
I have some basic ideas for ways to measure these properties, and if all four show up positive, I would consider that significant evidence that a “steganographic snowball” is going to happen. That would mean that if your AI has any tendency to hide its thoughts from you, you won’t be able to stop it forever. More specifically, you’ll have to fight off the full intelligence of your AI if you want to prevent it from securing the amount of hidden channel bandwidth it wants.
There’s room for retreat to a “snowball would take a long time and we can extract work in the meantime” position, but that seems like a serious concession.
Thanks to Paul Colognese for brainstorming this stuff with me.
A project I’ve been sitting on that I’m probably not going to get to for a while:
Improving on Automatic Circuit Discovery and Edge Attribution Patching by modifying them to run on algorithms that can detect complete boolean circuits. As it stands, both effectively use wire-by-wire patching, which when run on any nontrivial boolean circuits can only detect small subgraphs.
It’s a bit unclear how useful this will be, because:
not sure how useful I think mech interp is
not sure if this is where mech interp’s usefulness is bottlenecked
maybe attribution patching doesn’t work well when patching clean activations into a corrupted baseline, which would make this much slower
But I think it’ll be a good project to bang out for the experience, I’m curious what the results will be compared to ACDC/EAP.
This project started out as an ARENA capstone in collaboration with Carl Guo.
Small addendum to this post: I think the threat model I describe here can be phrased as “I’m worried that unless a lot of effort goes into thinking about how to get AI goals to be reflectively stable, the default is suboptimality misalignment. And the AI probably uses a lot of the same machinery to figure out that it’s suboptimality misaligned as it uses to perform the tasks we need it to perform.”
it seems like there is significant low hanging fruit in better understanding how LLMs will deal with censorship
Yup, agree—the censorship method I proposed in this post is maximally crude and simple, but I think it’s very possible that the broader category of “ways to keep your AI from thinking destabilizing thoughts” will become an important part of the alignment/control toolbox.
What happens when you iteratively finetune on censored text? Do models forget the censored behavior?
I guess this would be effectively doing the Harry Potter Unlearning method, provided you put in the work to come up with a good enough blacklist that the remaining completions are suitable generic predictions. I’d be quite interested in looking into how long-horizon capabilities interact with unlearning. For pretty much the same reason I’m worried long-horizon competent LLMs will be able to work around naive censorship, I’m also worried they’ll work around naive unlearning.
Evaluating Stability of Unreflective Alignment
Attempts at Forwarding Speed Priors
I think so. But I’d want to sit down and prove something more rigorously before abandoning the strategy, because there may be times we can get value for free in situations more complicated than this toy example.
Ok this is going to be messy but let me try to convey my hunch for why randomization doesn’t seem very useful.
- Say I have an intervention that’s helpful, and has a baseline 1⁄4 probability. If I condition on this statement, I get 1 “unit of helpfulness”, and a 4x update towards manipulative AGI.
- Now let’s say I have four interventions like the one above, and I pick one at random. p(O | manipulative) = 1⁄4, which is the same as baseline, so I get one unit of helpfulness and no update towards manipulative AGI!
- BUT, the four interventions have to be mutually exclusive. Which means that if I’d done no simulation at all, I would’ve gotten my one unit of helpfulness anyway, since the four interventions cover all possible outcomes.
- Ok, well, what if my four interventions 1⁄8 baseline probability each, so only 50% total. Then I pick one at random, p(O | natural) = 1⁄8, p(O | manipulative) = 1⁄4, so I get a 2x update towards manipulative AGI. This is the same as if I’d just conditioned on the statement “one of my four interventions happens”, and let the randomization happen inside the simulation instead of outside. The total probability of that is 50%, so I get my one unit of helpfulness, at the cost of a 2x update.
Maybe the core thing here is a consequence of framing our conditions as giving us bits of search to get lottery outcomes that we like. Rolling the dice to determine what to condition on isn’t doing anything different from just using a weaker search condition—it gives up bits of search, and so it has to pay less.
I agree that I wouldn’t want to lean on the sweet-spot-by-default version of this, and I agree that the example is less strong than I thought it was. I still think there might be safety gains to be had from blocking higher level reflection if you can do it without damaging lower level reflection. I don’t think that requires a task where the AI doesn’t try and fail and re-evaluate—it just requires that the re-evalution never climbs above a certain level in the stack.
There’s such a thing as being pathologically persistent, and such a thing as being pathologically flaky. It doesn’t seem too hard to train a model that will be pathologically persistent in some domains while remaining functional in others. A lot of my current uncertainty is bound up in how robust these boundaries are going to have to be.