I do alignment research at the Alignment Research Center. Learn more about me at markxu.com/about
Mark Xu
yep agreed, I have a bunch of vague plans in this direction. I most generally think that AI control is a pretty good tool in the toolbox, and is unlikely to make things much worse but plausibly makes things much better.
I agree it is better work on bottlenecks than non-bottlenecks. I have high uncertainty about where such bottlenecks will be, and I think sufficiently low amounts of work have gone into “control” that it’s obviously worth investing more, because e.g. I think it’ll let us get more data on where bottlenecks are.
Yes, I agree. If I had more time, this would have been a top-level post. If anyone reading wants to write such a post using my quick take as a base, I would be happy to take a look and offer comments. I might do it myself at some point as well.
Alignment researchers should think hard about switching to working on AI Control
I think Redwood Research’s recent work on AI control really “hits it out of the park”, and they have identified a tractable and neglected intervention that can make AI go a lot better. Obviously we should shift labor until the marginal unit of research in either area decreases P(doom) by the same amount. I think that implies lots of alignment researchers should shift to AI control type work, and would naively guess that the equilibrium is close to 50⁄50 across people who are reading this post. That means if you’re working on alignment and reading this, I think there’s probably a ~45% chance it would be better for your values if you instead were working on AI control!
For this post, my definitions are roughly:
AI alignment is the task of ensuring the AIs “do what you want them to do”
AI control is the task of ensuring that if the AIs are not aligned (e.g. don’t always “do what you want” and potentially want to mess with you), then you are still OK and can use them for economically productive tasks (an important one of which is doing more alignment/control research.)
Here are some thoughts, arguments, and analogies (epistemic status: there is no “hidden content”, if you don’t find the literal words I wrote persuasive you shouldn’t update. In particular, just update on the words and don’t update about what my words imply about my beliefs.):
Everything is in degrees. We can “partially align” some AIs, and things will be better if we can use those AIs for productive tasks, like helping with alignment research. The thing that actually matters is “how aligned are the AIs” + “how aligned to they need to be to use them for stuff”, so we should also focus on the 2nd thing.
If you were a hedge fund, and your strategy for preventing people from stealing your data was and starting new hedge fund was “we will make the hedge fund a super fun place to work and interview people carefully to make sure they really care about working for us”, and you also did not do things like “maybe people shouldn’t be able to download all the trading strategies onto a USB stick which we let them take home”, “maybe people should be forced to keep their work computers in the office” and “maybe we should have some people checking if other employees are trying to steal data and make some plans about how to stop them from starting a rival hedge fund if they do”, then all the other hedge funds would think you were an idiot. You should obviously try both to check if people want to defect, and also make it harder for them to defect.
I think a lot of the neglect of AI control can be traced to the “pivotal act” based framings of how to make aligned AI go well. I think these views are mistaken, but I won’t provide a great argument here (quick analogy: if the sum of a bunch of numbers is large, there doesn’t need to be any individual number that is large; similarly, if the consequences of a sequence of actions results in a large change, no individual action needs to be “pivotal”). However, I don’t think things are shaping up for single pivotal acts to be very important, and being able to productively use early AI labor to help with alignment seems like it is going to be an important part of the future going well.
Working on AI control is really tractable compared to working on alignment. There is a lot of prior art in people thinking about how to e.g. prevent data from being stolen by insider threats, secure various information against adversaries, generally ensure that national security secrets don’t get leaked, etc. We can set up various plausibly looking scenarios and try to “game them out” under certain plausibly seeming assumptions about how much alignment we’ll be able to achieve. We can make assumptions about AI capabilities and alignment without really getting into the nitty gritty of philosophically confusing topics like “what is deception really?” and “how likely is deception?” and “when will this all go down?”
Note that this means control might be easier for the AIs to automate, and easier to do “during crunch time.”
It also means that AI control is a bit contiguous with work people already do to protect from adversaries. I think the AI thing qualitatively changes the picture though, so it’s still worth thinking about.
If you succeed at AI control, there is a more compelling story you can tell for why you should implement the specific scheme that you thought of. In comparison, a lot of alignment proposals depend on philosophically dubious and confusing assumptions about the nature of cognition/SGD/etc. That is to say, a successful proposal in AI control comes with an argument that might be convincing to established security professionals if they thought about it, but there is likely no such class of professionals capable of lending your proposal external legitimacy in AI alignment.
The intelligence explosion might happen with less-fully-AGI AIs, who will also be doing some alignment work on the side. It’s important for them to not escape and do other bad stuff until they’ve solve alignment. We can give ourselves more time to use smart AIs to help with alignment if we have better AI control.
There is a bunch of research that is useful for both alignment and control, e.g. model organisms of deception, interpretability techniques, oversight techniques, etc. More people should analyze such research from the perspective of “how can this be helpful even if it fails to produce an aligned AI?”
- Dec 9, 2024, 9:48 PM; 5 points) 's comment on eggsyntax’s Shortform by (
shane legg had 2028 median back in 2008, see e.g. https://e-discoveryteam.com/2023/11/17/shane-leggs-vision-agi-is-likely-by-2028-as-soon-as-we-overcome-ais-senior-moments/
Yes I agree with what you have written, and do think it’s overall not that likely that everything pans out as hoped. We do also have other hopes for how this general picture can still cohere if the specific path doesn’t work out, eg we’re open to learning some stuff empirically and adding an “algorithmic cherry on top” to produce the estimate.
The literature review is very strange to me. Where is the section on certified robustness against epsilon-ball adversarial examples? The techniques used in that literature (e.g. interval propagation) are nearly identical to what you discuss here.
I was meaning to include such a section, but forgot :). Perhaps I will edit it in. I think such work is qualitatively similar to what we’re trying to do, but that the key difference is that we’re interested in “best guess” estimates, as opposed to formally verified-to-be-correct estimates (mostly because we don’t think formally verified estimates are tractable to produce in general).
Relatedly, what’s the source of hope for these kinds of methods outperforming adversarial training? My sense from the certified defenses literature is that the estimates they produce are very weak, because of the problems with failing to model all the information in activations. (Note I’m not sure how weak the estimates actually are, since they usually report fraction of inputs which could be certified robust, rather than an estimate of the probability that a sampled input will cause a misclassification, which would be more analogous to your setting.)
The main hope comes from the fact that we’re using a “best guess” estimate, instead of trying to certify that the model won’t produce catastrophic actions. For example, Method 1 can be thought of as running a single example with a Gaussian blob around it through the model, but also tracking the “1st order” contributions that come from the Gaussian blob. If we wanted to bound the potential contributions from the Gaussian blob, our estimates would get really broad really fast, as you tend to see with interval propagation.
Although, this also comes with the opposite issue of how to know if the estimates are at all reasonble, especially when you train against them.
If your catastrophe detector involves a weak model running many many inferences, then it seems like the total number of layers is vastly larger than the number of layers in M, which seems like it will exacerbate the problems above by a lot. Any ideas for dealing with this?
I think fundamentally we just need our estimates to “not get that much worse” as things get deeper/more complicated. The main hope for why we can achieve this is that the underlying model itself will not get worse as it gets deeper/the chain of thought gets longer. This implies that there is some sort of stabalization going on, so we will need to capture the effect of this stabalization. It does seem like in order to do this, we will have to model only high level properties of this distribution, instead of trying to model things on the level of activations.
In other words, one issue with interval propagation is that it makes an assumption that can only become less true as you propagate through the model. After a few layers, you’re (perhaps only implicitly) putting high probability on activations that the model will never produce. But as long as your “activation model” is behaving reasonably, then hopefully it will only become more uncertain insofar as the underlying reasoning done by the model becomes more uncertain.
What’s your proposal for the distribution P0 for Method 2 (independent linear features)?
You can either train an SAE on the input distribution, or just try to select the input distribution to maximize the probability of catastrophe produced by the estimation method (perhaps starting with an SAE of the input distribution, or a random one). Probably this wouldn’t work that well in practice.
Why think this is a cost you can pay? Even if we ignore the existence of C and just focus on M, and we just require modeling the correlations between any pair of layers (which of course can be broken by higher-order correlations), that is still quadratic in the number of parameters of M and so has a cost similar to training M in the first place. In practice I would assume it is a much higher cost (not least because C is so much larger than M).
Our ultimate goal is vaguely to “only pay costs that SGD had to pay to produce M” Slightly more specifically, M has a bunch of correlations between its layers. Some of these correlations were actively selected to be those particular values by SGD, and other correlations were kind of random. We want to track the ones that were selected, and just assume the other ones are random. Hopefully, since SGD was not actively manipulating those correlations, the underlying model is in some sense invariant to their precise values, and so a model that treats such correlations as random will predict the same underlying behavior as a model that models the precise values of those correlations.
I don’t think Paul thinks verification is generally easy or that delegation is fundamentally viable. He, for example, doesn’t suck at hiring because he thinks it’s in fact a hard problem to verify if someone is good at their job.
I liked Rohin’s comment elsewhere on this general thread.
I’m happy to answer more specific questions, although would generally feel more comfortable answering questions about my views then about Paul’s.
If you’re commited to producing a powerful AI then the thing that matters is the probability there exists something you can’t find that will kill you. I think our current understanding is sufficiently paltry that the chance of this working is pretty low (the value added by doing selection on non-deceptive behavior is probably very small, but I think there’s a decent chance you just won’t get that much deception). But you can also get evidence about the propensity for your training process to produce deceptive AIs and stop producing them until you develop better understanding, or alter your training process in other ways. For example, you can use your understanding of the simpler forms of deception your AIs engage in to invest resources in understanding more complicated forms of deception, e.g. by focusing interpretability efforts.
For any given system, you have some distribution over which properties will be necessary to verify in order to not die to that system. Some of those you will in fact be able to verify, thereby obtaining evidence about whether that system is dangerous. “Strategic deception” is a large set of features, some of which are possible to verify.
yes, you would need the catastrophe detector to be reasonably robust. Although I think it’s fine if e.g. you have at least 1/million chance of catching any particular catastrophe.
I think there is a gap, but that the gap is probably not that bad (for “worst case” tail risk estimation). That is maybe because I think being able to do estimation through a single forward pass is likely already to be very hard, and to require being able to do “abstractions” over the concepts being manipulated by the forward pass. CoT seems like it will require vaguely similar struction of a qualitatively similar kind.
I think there are some easy-to-verify properties that would make us more likely to die if they were hard-to-verify. And therefore think “verification is easier than generation” is an important part of the overall landscape of AI risk.
I think both that:
this is not a good characterization of Paul’s views
verification is typically easier than generalization and this fact is important for the overall picture for AI risk
I also think that this post is pulling a bit of a motte-and-bailey, although not really in the sense that John claims he is making in argument in the post:
the motte: there exist hard to verify properties
the bailey: all/most important properties are hard to verify
I agree ergonimics can be hard to verify. But some ergonomics are easy to verify, and chairs conform to those ergonomics (e.g. having a backrest is good, not having sharp stabby parts are good, etc.).
I agree that there are some properties of objects that are hard to verify. But that doesn’t mean generation is as hard as verification in general. The central property of a chair (that you can sit on it) is easy to verify.
I think catastrophe detectors in practice will be composed of neural networks interacting with other stuff, like scientific literature, python, etc.
With respect to the stuff quoted, I think all but “doing experiments” can be done with a neural net doing chain of thought (although not making claims about quality).
I think we’re trying to solve a different problem than trusted monitoring, but I’m not that knowledgeable about what issues trusted monitoring is trying to solve. The main thing that I don’t think you can do with monitoring is producing a model that you think is unlikely to result in catastrophe. Monitoring lets you do online training when you find catastrophe, but e.g. there might be no safe fallback action that allows you to do monitoring safely.
Separately, I do think it will be easy to go from “worst-case” NN-tail-risk estimation to “worst case” more general risk estimation. I do not think it will be easy to go from “typical case” NN-tail-risk estimation to more general “typical case” risk estimation, but think that “typical case” NN-tail-risk estimation can meaningfully reduce safety despite not being able to do that generalization.
Re. more specific hopes: if your risk estimate is conducted by model with access to tools like python, then we can try to do two things:
vaguely get an estimate that is as good as the estimate you would get if you replaced “python” with your model’s subjective distribution over the output of whatever it runs through python.
learn some “empirical regularities” that govern how python works (as expected by your model/SGD)
(these might be the same thing?)
Another argument: one reason why doing risk estimates for NN’s is hard is because the estimate can rely on facts that live in some arbitrary LLM ontology. If you want to do such an estimate for an LLM bureaucracy, some fraction of the relevant facts will live in LLM ontology and some fraction of facts will live in words passed between models. Some fraction of facts will live in a distributed way, which adds complications, but those distributed facts can only affect the output of the bureacracy insofar as they are themselves manipulated by an LLM in that bureacracy.
I have left a comment about a central way I think this post is misguided: https://www.lesswrong.com/posts/7fJRPB6CF6uPKMLWi/my-ai-model-delta-compared-to-christiano?commentId=sthrPShrmv8esrDw2
This post uses “I can identify ways in which chairs are bad” as an example. But it’s easier for me to verify that I can sit in a chair and that it’s comfortable then to make a chair myself. So I don’t really know why this is a good example for “verification is easier than generation”.
More examples:
I can tell my computer is a good typing machine, but cannot make one myself
I can tell a waterbottle is water tight, but do not know how to make a water bottle
I can tell that my pepper grinder grinds pepper, but do not know how to make a pepper grinder.
If the goal of this post is to discuss the crux https://www.lesswrong.com/posts/fYf9JAwa6BYMt8GBj/link-a-minimal-viable-product-for-alignment?commentId=mPgnTZYSRNJDwmr64:
evaluation isn’t easier than generation, and that claim is true regardless of how good you are at evaluation until you get basically perfect at it
then I think there is a large disconnect between the post above, which is positing that in order for this claim to be false there has to be some “deep” sense in which delagation is viable, and the sense in which I think this crux is obviously false in the more mundane sense in which all humans interface with the world and optimize over the products other people create, and are therefore more capable than they would have been if they had to make all products for themselves from scratch.
- Sep 14, 2024, 12:02 AM; 3 points) 's comment on My AI Model Delta Compared To Christiano by (
I think “basically obviates” is too strong. imitation of human-legible cognitive strategies + RL seems liable to produce very different systems that would been produced with pure RL. For example, in the first case, RL incentizes the strategies being combine in ways conducive to accuracy (in addition to potentailly incentivizing non-human-legible cognitive strategies), whereas in the second case you don’t get any incentive towards productively useing human-legible cogntive strategies.
My vague plan along these lines is to attempt as hard as possible to defer all philosophically confusing questions to the “long reflection”, and to use AI control as a tool to help produce AIs that can help preserve long term option value (including philosophical option value) as best as possible.
I seperately have hope we can solve “the entire problem” at some point, e.g. through ARC’s agenda (which I spend most of my time trying to derisk and advance).