This is a perspective I have on how to do useful AI alignment research. Most perspectives I’m aware of are constructive: they have some blueprint for how to build an aligned AI system, and propose making it more concrete, making the concretisations more capable, and showing that it does in fact produce an aligned AI system. I do not have a constructive perspective—I’m not sure how to build an aligned AI system, and don’t really have a favourite approach. Instead, I have an analytic perspective. I would like to understand AI systems that are built. I also want other people to understand them. I think that this understanding will hopefully act as a ‘filter’ that means that dangerous AI systems are not deployed. The following dot points lay out the perspective.
Since the remainder of this post is written as nested dot points, some readers may prefer to read it in workflowy.
Background beliefs
I am imagining a future world in which powerful AGI systems are made of components roughly like neural networks (either feedforward or recurrent) that have a large number of parameters.
Furthermore, I’m imagining that the training process of these ML systems does not provide enough guarantees about deployment performance.
In particular, I’m supposing that systems are being trained based on their ability to deal with simulated situations, and that that’s insufficient because deployment situations are hard to model and therefore simulate.
One reason that they are hard to model is the complexities of the real world.
The real world might be intrinsically difficult to model for the relevant system. For instance, it’s difficult to simulate all the situations in which the CEO of Amazon might find themselves.
Another reason that real world situations may be hard to model is that they are dependent on the final trained system.
The trained system may be able to affect what situations it ends up in, meaning that situations during earlier training are unrepresentative.
Parts of the world may be changing their behaviour in response to the trained system…
in order to exploit the system.
by learning from the system’s predictions.
The real world is also systematically different than the trained world: for instance, while you’re training, you will never see the factorisation of RSA-2048 (assuming you’re training in the year 2020), but in the real world you eventually will.
This is relevant because you could imagine mesa-optimisers appearing in your system that choose to act differently when they see such a factorisation.
I’m imagining that the world is such that if it’s simple for developers to check if an AI system would have disastrous consequences upon deployment, then they perform this check, and fail to deploy if the check says that it would.
Background desiderata
I am mostly interested in allowing the developers of AI systems to determine whether their system has the cognitive ability to cause human extinction, and whether their system might try to cause human extinction.
I am not primarily interested in reducing the probabilities of other ways in which AI systems could cause humanity to go extinct, such as research groups intentionally behaving badly, or an uncoordinated set of releases of AI systems that interact in negative ways.
That being said, I think that pursuing research suggested by this perspective could help with the latter scenario, by making it clear which interaction effects might be present.
I want this determination to be made before the system is deployed, in a ‘zero-shot’ fashion, since this minimises the risk of the system actually behaving badly before you can detect and prevent it.
Transparency
The type of transparency that I’m most excited about is mechanistic, in a sense that I’ve described elsewhere.
The transparency method itself should be based on a trusted algorithm, as should the method of interpreting the transparent artefact.
In particular, these operations should not be done by a machine learning system, unless that system itself has already been made transparent and verified.
This could be done amplification-style.
Ideally, models could be regularised for transparency during training, with little or no cost to performance.
This would be good because by default models might not be very transparent, and it might be hard to hand-design very transparent models that are also capable.
I think of this as what one should derive from Rich Sutton’s bitter lesson
This will be easier to do if the transparency method is simpler, more ‘mathematical’, and minimally reliant on machine learning.
You might expect little cost to performance since neural networks can often reach high performance given constraints, as long as they are deep enough.
This paper on the intrinsic dimension of objective landscapes shows that you can constrain neural network weights to a low-dimensional subspace and still find good solutions.
This paper argues that there are a large number of models with roughly the same performance, meaning that ones with good qualities (e.g. interpretability) can be found.
This paper applies regularisation to machine learning models that ensures that they are represented by small decision trees.
The transparency method only has to reveal useful information to developers, not to the general public.
This makes the problem easier but still difficult.
Presumably developers will not deploy catastrophically terrible systems, since catastrophes are usually bad for most people, and I’m most interested in averting catastrophic outcomes.
Foundations
In order for the transparency to be useful, practitioners need to know what problems to look for, and how to reason about these problems.
I think that an important part of this is ‘agent foundations’, by which I broadly mean a theory of what agents should look like, and what structural facts about agents could cause them to display undesired behaviour.
Examples:
Work on mesa-optimisation
Utility theory, e.g. the von Neumann-Morgenstern theorem
Methods of detecting which agents are likely to be intelligent or dangerous.
For this, it is important to be able to look at a machine learning system and learn if (or to what degree) it is agentic, detect belief-like structures and preference-like structures (or to deduce things analogous to beliefs and preferences), and learn other similar things.
This requires structural definitions of the relevant primitives (such as agency), not subjective or performance-based definitions.
By ‘structural definitions’, I mean definitions that refer to facts that are easily accessible about the system before it is run.
By ‘subjective definitions’, I mean definitions that refer to an observer’s beliefs or preferences regarding the system.
By ‘performance-based definitions’, I mean definitions that refer to facts that can be known about the system once it starts running.
Subjective definitions are inadequate because they do not refer to easily-measurable quantities.
Performance-based definitions are inadequate because they can only be evaluated once the system is running, when it could already pose a danger, violating the “zero-shot” desideratum.
Structural definitions are required because they are precisely the definitions that are not subjective or performance-based that also only refer to facts that are easily accessible, and therefore are easy to evaluate whether a system satisfies the definition.
As such, definitions like “an agent is a system whose behaviour can’t usefully be predicted mechanically, but can be predicted by assuming it near-optimises some objective function” (which was proposed in this paper) are insufficient because they are both subjective and performance-based.
It is possible to turn subjective definitions into structural definitions trivially, by asking a human about their beliefs and preferences. This is insufficient.
e.g. “X is a Y if you are scared of it” can turn to “X is a Y if the nearest human to X, when asked if they are scared of X, says ‘yes’”.
It is insufficient because such a definition doesn’t help the human form their subjective beliefs and impressions.
It is also possible to turn subjective definitions that only depend on beliefs into structural definitions by determining which circumstances warrant a rational being to have which beliefs. This is sufficient.
Compare the subjective definition of temperature as “the derivative of a system’s energy with respect to entropy at fixed volume and particle number” to the objective definition “equilibrate the system with a thermometer, read it off the thermometer”. For a rational being, these two definitions yield the same temperature for almost all systems.
Relation between transparency and foundations
The agent foundations theory should be informed by transparency research, and vice versa.
This is because the information that transparency methods can yield should be all the information that is required to analyse the system using the agent foundations theory.
Both lines of research can inform the other.
Transparency researchers can figure out how to reveal the information required by agent foundations theory, and detect the existence of potential problems that agent foundations theory suggests might occur given certain training procedures.
Agent foundations researchers can figure out what is implied by the information revealed by existing transparency tools, and theorise about problems that transparency researchers detect.
Criticisms of the perspective
It isn’t clear if neural network transparency is possible.
More specifically, it seems imaginable that some information required to usefully analyse an AI system cannot be extracted from a typical neural network in polynomial time.
It isn’t clear that relevant terms from agency theory can in fact be well-defined.
E.g. “optimisation” and “belief” have eluded a satisfactory computational grounding for quite a while.
Relatedly, the philosophical question of which physical systems enable which computations has not to my mind been satisfactorily resolved. See this relevant SEP article.
An easier path to transparency than the “zero-shot” approach might be to start with simpler systems, observe their behaviour, and slowly scale them up. As you see problems, stop scaling up the systems, and instead fix them so the problems don’t occur.
I disagree with this criticism.
At one point, it’s going to be the first time you use a system of a given power in a domain, and the problems caused by the system might be discontinuous with its power, meaning that they would be hard to predict.
Especially if the power of the system increases discontinuously.
It is plausibly the case that systems that are a bit ‘smarter than humanity’ are discontinuously more problematic than those that are a bit less ‘smart than humanity’.
One could imagine giving up the RL dream for something like debate, where you really can get guarantees from the training procedure.
I think that this is not true, and that things like debate require transparency tools to work well, so as to let debaters know when other debaters are being deceitful. An argument for an analogous conclusion can be found in evhub’s post on Relaxed adversarial training for inner alignment.
One could imagine inspecting training-time reasoning and convincing yourself that way that future reasoning will be OK.
But reasoning could look different in different environments.
This perspective relies on things continuing to look pretty similar to current ML.
This would be alleviated if you could come up with some sort of sensible theory for how to make systems transparent.
I find it plausible that the development of such a theory should start with people messing around and doing things with systems they have.
Systems should be transparent to all relevant human stakeholders, not just developers.
Sounds right to me—I think people should work on this broader problem. But:
I don’t know how to solve that problem without making them transparent to developers initially.
I have ideas about how to solve the easier problem.
Overall take: Broadly agree that analyzing neural nets is useful and more work should go into it. Broadly disagree with the story for how this leads to reduced x-risk. Detailed comments below:
Background beliefs:
Broadly agree, with one caveat:
I’m assuming “guarantees” means something like “strong arguments”, and would include things like “when I train the agent on this loss function and it does well on the validation set, it will also do well on a test set drawn from the same distribution” (although I suppose you can prove that that holds with high probability). Perhaps a more interesting strong argument that’s not a proof but that might count as a guarantee would be something like “if I perform adversarial training with a sufficiently smart adversary, it is unlikely that the agent finds and fails on an example that was within the adversary’s search space”.
If you include these sorts of things as guarantees, then I think the training process “by default” won’t provide enough guarantees, but we might be able to get it to provide enough guarantees, e.g. by adversarial training. Alternatively, there will exist training processes that won’t provide enough guarantees but will knowably be likely to produce AGI; but there may also be versions that do provide enough guarantees.
Background desiderata:
This seems normative rather than empirical. Certainly we need some form of ‘zero-shot’ analysis—in particular, we must be able to predict whether a system causes x-risk in a zero-shot way (you can’t see any examples of a system actually causing x-risk). But depending on what exactly you mean, I think you’re probably aiming for too strong a property, one that’s unachievable given background facts about the world. (More explanation in the Transparency section.)
Ways in which this desideratum is unclear to me:
Why is the distinction between training and deployment important? Most methods of training involve the AI system acting. Are you hoping that the training process (e.g. gradient descent) leads to safety?
Presumably many forms of interpretability techniques involve computing specific outputs of the neural net in order to understand them. Why doesn’t this count as “running” the neural net?
My best guess is that you are distinguishing between the AI system acting in the real world during deployment (whereas training and interpretability were in simulation or with hypothetical inputs, or involved some other form of boxing that prevented it from “doing much real stuff”). What about training schemes in which the agent gradually becomes more and more exposed to the real world? Where is “deployment” then? (For example, consider OpenAI Five: while most of its training was in simulation, it played several games against humans during training, with more and more capable humans, and then eventually was “given access” to the full Internet via Arena. Which point was “deployment”?)
EDIT: Tbc, I think “deployment” is a relatively crisp concept when considering AI governance, where you can think of it as the point at which you release the AI system into the world and other actors besides the one that trained the system start interacting with it in earnest, and this point is a pretty important point in terms of the impacts of the AI system. For OpenAI Five, this would be the launch of Arena. But this sort of distinction seems much less relevant / crisp for AI alignment.
Transparency:
Mechanistic transparency seems incredibly difficult to achieve to me. As an analogy, I don’t think I understand how a laptop works at a mechanistic level, despite having a lot of training in Computer Science. This is a system that is built to be interpretable to humans, human civilization as a whole has a mechanistic understanding of laptops, and lots of effort has been put into creating good educational materials that most clearly convey a mechanistic understanding of (components of) laptops—we have none of these advantages for neural nets. Of course, a laptop is very complex; but I would expect an AGI-via-neural-nets to be pretty complex as well.
I also think that mechanistic transparency becomes much more difficult as systems become more complex: in the best case where the networks are nice and modular, it becomes linearly harder, which might keep the cost ratio the same (seems plausible to scale human effort spent understanding the net at the same rate that we scale model capacity), but if it is superlinearly harder (seems more likely to me, because I don’t expect it to be easy to identify human-interpretable modularity even when present), then as model capacity increases, human oversight becomes a larger and larger fraction of the cost.
Currently human oversight is already 99+% of the cost of mechanistically transparent image classifiers: Chris Olah and co. have spent multiple years on one image classifier and are maybe getting close to a mechanistic-ish understanding of it, though of course presumably future efforts would be less costly because they’ll have learned important lessons. (Otoh, things that aren’t image classifiers are probably harder to mechanistically understand, especially things that are better-than-human, as in e.g. AlphaGo’s move 37.)
Controversial, I’m pretty uncertain but weakly lean against. (Probably not worth discussing though, just wanted to note the disagreement.)
But interestingly, you can’t just use fewer neurons (corresponding to a low-dimensional subspace where the projection matrices consists of unit vectors along the axes) -- it has to be a random subspace. I think we don’t really understand what’s going on here and I wouldn’t update too much on the possibility of transparency from it (though it is weak evidence that regularization is possible and strong evidence that there are lots of good models).
Compare: There are a large number of NBA players, meaning that ones who are short can be found.
Looking at the results of the paper, it only seems to work for simple tasks, as you might expect. For the most neural-net-like task (recognizing stop phonemes from audio, which is still far simpler than e.g. speech recognition), the neural net gets ~0.95 AUC while the decision tree gets ~0.75 (a vast difference: random is 0.5 and perfect is 1).
Generally there seem to be people (e.g. Cynthia Rudin) who argue “we can have interpretability and accuracy”, and when you look at the details they are looking at some very low-dimensional, simple-looking tasks; I certainly agree with that (and that we should use interpretable models in these situations) but it doesn’t seem to apply to e.g. image classifiers or speech recognition, and seems like it would apply even less to AGI-via-neural-nets.
Huh? Surely if you’re trying to understand agents that arise, you should have a theory of arbitrary agents rather than ideal agents. John Wentworth’s stuff seems way more relevant than MIRI’s Agent Foundations for the purpose you have in mind.
I could see it being useful to do MIRI-style Agent Foundations work to discover what sorts of problems could arise, though I could imagine this happening in many other ways as well.
Transparency
I think it’s quite difficult to achieve but not impossible, and worth aiming for. My main take is that (a) it seems plausibly achievable and (b) we don’t really know how difficult it is to achieve because most people don’t seem very interested in trying to achieve it, so some people should spend a bunch of effort trying and seeing how it pans out. But, as mentioned in the first dotpoint in the criticisms section, I do regard this as an open question.
Note that I’m not asking for systems to be mechanistically transparent to people with backgrounds and training in the relevant field, just that they be mechanistically transparent to their developers. This is still difficult, but as far as I know it’s possible for laptops (although I could be wrong about this, I’m not a laptop expert).
This basically seems right to me, and as such I’m researching how to make networks modular and identify their modularity structure. It feels to me like this research is doing OK and is not obviously doomed.
I disagree: for instance, it seems way more likely to me that planning involves crisp mathematisable algorithms than that image recognition involves such algorithms.
Whoa, I’m so confused by that. It seems pretty clear to me that it’s easier to regularise for properties that have nicer, more ‘mathematical’ definitions, and if that’s false then I might just be fundamentally misunderstanding something.
I’d be shocked if there was anyone to whom it was mechanistically transparent how a laptop loads a website, down to the gates in the laptop.
I’d be surprised if there was anyone to whom it was mechanistically transparent how a laptop boots up, down to the gates in the laptop. (Note you’d have to understand the entire BIOS as well as all of the hardware in the laptop.)
It’s easier in the sense that it’s easier to compute it in Tensorflow and then use gradient descent to make the number smaller / bigger. But if you ignore that factor and ask whether a more mathematical definition will lead to more human-interpretability, then I don’t see a particular reason to expect mathematical definitions to work better.
I think my argument was more like “in the world where your modularity research works out perfectly, you get linear scaling, and then it still costs 100x to have a mechanistically-understood AI system relative to a black-box AI system, which seems prohibitively expensive”. And that’s without including a bunch of other difficulties:
Right now we’re working with subhuman AI systems where we already have concepts that we can use to understand AI systems; this will become much more difficult with superhuman AI systems.
All abstractions are leaky; as you build up hierarchies of abstractions for mechanistically understanding a neural net, the problems with your abstractions can cause you to miss potential problems. (As an analogy, when programming without any APIs / external code, you presumably mechanistically understand the code you write; yet bugs are common in such programming.)
With image classifiers we have the benefit of images being an input mechanism we are used to; it will presumably be a lot harder with input mechanisms we aren’t used to.
It is certainly not unimaginable to me that these problems get solved somehow, but to convince me to promote this particular story for AI alignment to attention (at least beyond the threshold of “a smart person I know is excited about it”), you’d need to have some story / hope for how to deal with these problems. (E.g. as you mention in your post, you could imagine dealing with the last one using something like iterated amplification? Maybe?)
Here are some other stories for preventing catastrophes:
Regulations / laws to not build powerful AI
Increasing AI researcher paranoia, so all AI researchers are very careful with powerful AI systems
BoMAI-style boxing (“all the powerful AI systems we build don’t care about anything that would make catastrophe instrumentally useful”)
Impact regularization (“all the AI systems we build don’t want to do something as high-impact as a catastrophe”)
Safety benchmarks (set of tests looking for common problems, updated as we encounter new problems) (“all the potentially dangerous AI systems we could have built failed one of the benchmark tests”)
Any of the AI alignment methods, e.g. value learning or iterated amplification (“we don’t build dangerous AI systems because we build aligned AI systems instead”)
Currently I find all of these stories more plausible than the story “we don’t deploy a dangerous AI system because the developers mechanistically understood the dangerous AI system, detected the danger, and decided not to deploy it”.
I want to emphasize that I think the general research direction is good and will be useful and I want more people to work on it (it makes the first, second, fifth and sixth bullet points above more effective); I only disagree with the story you’ve presented for how it reduces x-risk.
How this perspective could reduce the probability of catastrophes
To be clear: the way I imagine this research direction working is that somebody comes up with a theory of how to build aligned AI, roughly does that, and then uses some kind of transparency to check whether or not they succeeded. A big part of the attraction to me is that it doesn’t really depend on what exact way aligned AI gets built, as long as it’s built using methods roughly similar to modern neural network training. That being said, if it’s as hard as you think it will be, I don’t understand how it could usefully contribute to the dot points you mention.
Taking each of the bullet points I mentioned in turn:
You could imagine a law “we will not build AI systems that use >X amount of compute unless they are mechanistically transparent”. Then research on mechanistic transparency reduces the cost of such a law, making it more palatable to implement it.
The most obvious way to do this is to demonstrate that powerful AI systems are dangerous. One very compelling demonstration would be to train an AI system that we expect to be deceptive (that isn’t powerful enough to take over), make it mechanistically transparent, and show that it is deceptive.
Here, the mechanistic transparency would make the demonstration much more compelling (relative to a demonstration where you show deceptive behavior, but there’s the possibility that it’s just a weird bug in that particular scenario).
Mechanistic transparency opens up the possibility for safety tests of the form “train an AI system on this environment, and then use mechanistic transparency to check if it has learned <prohibited cognition>”. (You could imagine that the environment is small, or the models trained are small, and that’s why the cost of mechanistic transparency isn’t prohibitive.)
Informed oversight can be solved via universality or interpretability; worst-case optimization currently relies on “magic” interpretability techniques. Even if full mechanistic transparency is too hard to do, I would expect that insights along the way would be helpful. For example, perhaps in adversarial training, if the adversary shares weights with the agent, the adversary already “knows” what the agent is “thinking”, but it might need to use mechanistic transparency just for the final layer to understand what that part is doing.
If mechanistic transparency barely works and/or is super expensive, then presumably this law doesn’t look very good compared to other potential laws that prevent the building of powerful AI, so you’d think that marginal changes in how good we are at mechanistic transparency would do basically nothing (unless you’ve got the hope of ‘crossing the threshold’ to the point where this law becomes the most viable such law).
The other bullet points make sense though.
The costs of mechanistic transparency
I guess I don’t understand why linear scaling would imply this—in fact, I’d guess that training should probably be super-linear, since each backward pass takes linear time, but the more neurons you have, the bigger the parameter space, and so the greater number of gradient steps you need to take to reach the optimum, right?
At any rate, I agree that 100x cost is probably somewhat too expensive. If that estimate comes from OpenAI’s efforts to understand image recognition, I think it’s too high, since we presumably learned a bunch about what to look for from their efforts. I also think you’re underweighing the benefits of having a better theory of how effective cognition is structured. Responding to your various bullet points:
I can’t think of any way around the fact that this will likely make the work harder. Ideally it would bring incidental benefits, though (once you understand new super-human concepts you can use them in other systems).
Once you have a model of a module such that if the module worked according to your model things would be fine, you can just train the module to better fit your model. Hopefully by re-training the modules independently, to the extent you have errors they’re uncorrelated and result in reduced performance rather than catastrophic failure.
I think this is a minor benefit. In most domains, specialists will understand the meanings of input data to their systems: I can’t think of any counterexamples, but perhaps you can. Then, once you understand the initial modules, you can understand their outputs in terms of their inputs, and by recursion it seems like you should be able to understand the inputs and outputs of all modules.
This paper on scaling laws for training language models seems like it should help us make a rough guess for how training scales. According to the paper, your loss in nats if you’re only limited by cost C scales as C−0.05, and if you’re only limited by number of parameters N it scales with N−0.08. If we can equate those in the limit, which is not at all obvious to me, that suggests that cost goes as number of parameters to the 1.6 power, and number of parameters itself is polynomial in the number of neurons. So, the comprehension can be a little polynomial in the number of neurons, but it certainly can’t be exponential.
Yup, that seems like a pretty reasonable estimate to me.
Note that my default model for “what should be the input to estimate difficulty of mechanistic transparency” would be the number of parameters, not number of neurons. If a neuron works over a much larger input (leading to more parameters), wouldn’t that make it harder to mechanistically understand?
Yeah, that’s plausible. This does mean the mechanistic transparency cost could scale sublinearly w.r.t compute cost, though I doubt it (for the other reasons I mentioned).
Nah, I just pulled a number out of nowhere. The estimate based on existing efforts would be way higher. Back of the envelope: it costs ~$50 to train on ImageNet (see here). Meanwhile, there have been probably around 10 person-years spent on understanding one image classifier? At $250k per person-year, that’s $2.5 million on understanding, making it 50,000x more expensive to understand it than to train it.
Things that would move this number down:
Including the researcher time in the cost to train on ImageNet. I think that we will soon (if we haven’t already) enter the regime where researcher cost < compute cost, so that would only change the conclusion by a factor of at most 2.
Using the cost for an unoptimized implementation, which would probably be > $50. I’d expect those optimizations to already be taken for the systems we care about—it’s way more important to get a 2x cost reduction when your training run costs $100 million than when your training run costs under $1000.
Including the cost of hyperparameter tuning. This also seems like a thing we will cause to be no more than a factor of 2, e.g. by using population-based training of hyperparameters.
Including the cost of data collection. This seems important, future data collection probably will be very expensive (even if simulating, there’s the compute cost of the simulation), but idk how to take it into account. Maybe decrease the estimate by a factor of 10?
You could also just use the model, if it’s fast. It would be interesting to see how well this works. My guess is that abstractions are leaky because there are no good non-leaky abstractions, which would predict that this doesn’t work very well.
I think this is basically just the same point as “the problem gets harder when the AI system is superhuman”, except the point is that the AI system becomes superhuman way faster on domains that are not native to humans, e.g. DNA, drug structures, protein folding, math intuition, relative to domains that are native to humans, like image classification.
Do we mechanistically understand laptops?
So, I don’t think I’m saying that you have to mechanistically understand how every single gate works—rather, that you should be able to understand intermediate-level sub-systems and how they combine to produce the functionality of the laptop. The understanding of the intermediate-level sub-systems has to be pretty complete, but probably need not be totally complete—in the laptop case, you can just model a uniform random error rate and you’ll be basically right, and I imagine there should be something analogous with neural networks. Of course, you need somebody to be in charge of understanding the neurons in order to build to your understanding of the intermediate-level sub-systems, but it doesn’t seem to me that there needs to be any single person who understands all the neurons entirely—or indeed even any single person who needs to understand all the intermediate-level sub-systems entirely.
I think I should not have used the laptop example, it’s not really communicating what I meant it to communicate. I was trying to convey “mechanistic transparency is hard” rather than “mechanistic transparency requires a single person to understand everything”.
I guess I still don’t understand why you believe mechanistic transparency is hard. The way I want to use the term, as far as I can tell, laptops are acceptably mechanistically transparent to the companies that create them. Do you think I’m wrong?
No, which is why I want to stop using the example.
(The counterfactual I was thinking of was more like “imagine we handed a laptop to 19th-century scientists, can they mechanistically understand it?” But even that isn’t a good analogy, it overstates the difficulty.)
Could you clarify why this is an important counterpoint. It seems obviously useful to understand mechanistic details of a laptop in order to debug it. You seem to be arguing the [ETA: weaker] claim that nobody understands the an entire laptop “all at once”, as in, they can understand all the details in their head simultaneously. But such an understanding is almost never possible for any complex system, and yet we still try to approach it. So this objection could show that mechanistic transparency is hard in the limit, but it doesn’t show that mechanistic transparency is uniquely bad in any sense. Perhaps you disagree?
weaker claim?
This seems to be assuming that we have to be able to take any complex trained AGI-as-a-neural-net and determine whether or not it is dangerous. Under that assumption, I agree that the problem is itself very hard, and mechanistic transparency is not uniquely bad relative to other possibilities.
But my point is that because it is so hard to detect whether an arbitrary neural net is dangerous, you should be trying to solve a different problem. This only depends on the claim that mechanistic transparency is hard in an absolute sense, not a relative sense (given the problem it is trying to solve).
Relatedly, from Evan Hubinger:
All of the other stories for preventing catastrophe that I mentioned in the grandparent are tackling a hopefully easier problem than “detect whether an arbitrary neural net is dangerous”.
Oops yes. That’s the weaker claim, that I agree with. The stronger claim is that because we can’t understand something “all at once” then mechanistic transparency is too hard and so we shouldn’t take Daniel’s approach. But the way we understand laptops is also in a mechanistic sense. No one argues that because laptops are too hard to understand all at once, then we should’t try to understand them mechanistically.
I didn’t assume that. I objected to the specific example of a laptop as an instance of mechanistic transparency being too hard. Laptops are normally understood well because understanding can be broken into components and built up from abstractions. But each our understanding of each component and abstraction is pretty mechanistic—and this understanding is useful.
Furthermore, because laptops did not fall out of the sky one day, but instead slowly built over successive years of research and development, it seems like a great example of how Daniel’s mechanistic transparency approach does not rely on us having to understand arbitrary systems. Just as we built up an understanding of laptops, presumably we could do the same with neural networks. This was my interpretation of why he is using Zoom In as an example.
Indeed, but I don’t think this was the crux of my objection.
Okay, I think I see the miscommunication.
The story you have is “the developers build a few small neural net modules that do one thing, mechanistically understand those modules, then use those modules to build newer modules that do ‘bigger’ things, and mechanistically understand those, and keep iterating this until they have an AGI”. Does that sound right to you? If so, I agree that by following such a process the developer team could get mechanistic transparency into the neural net the same way that laptop-making companies have mechanistic transparency into laptops.
The story I took away from this post is “we do end-to-end training with regularization for modularity, and then we get out a neural net with modular structure. We then need to understand this neural net mechanistically to ensure it isn’t dangerous”. This seems much more analogous to needing to mechanistically understand a laptop that “fell out of the sky one day” before we had ever made a laptop.
My critiques are primarily about the second story. My critique of the first story would be that it seems like you’re sacrificing a lot of competitiveness by having to develop the modules one at a time, instead of using end-to-end training.
You could imagine a synthesis of the two stories: train a medium-level smart thing end-to-end, look at what all the modules are doing, and use those modules when training a smarter thing.
Foundations
You’re right that you don’t just want a theory of ideal agents. But I think it’s sufficient to only have a theory of very good agents, and discard the systems that you train that aren’t very good agents. This is more true the more optimistic you are about ML producing very good agents.
Papers
I agree that none of the papers are incredibly convincing on their own. I’d say the most convincing empirical work so far should be the sequence of posts on ‘circuits’ on Distill, starting with this one, but even that isn’t totally compelling. They’re just meant to provide some evidence that this sort of thing is possible, and to stand in the face of the lack of papers proving that it isn’t (although of course even if true it would be hard to prove).
Re: the Rashomon paper, you’re right, that implication doesn’t hold. But it is suggestive that there may well be ‘interpretable’ models that are near-optimal.
Re: the regularisation paper, I agree that it doesn’t work that well. But it’s the first paper in this line of work, and I think it’s plausibly illustrative of things that might work.
Background desiderata
For what it’s worth, I really dislike this terminology. Of course saying “I want X” is normative, and of course it’s based on empirical beliefs.
I’m imagining that during training, your ML system doesn’t control actuators which would allow it to pose an existential risk or other catastrophe (e.g. a computer screen watched by a human, the ability to send things over the internet). Basically, I want the zero-shot analysis to be done before the AI system can cause catastrophe, which during this piece I’m conflating with the training phase, although I guess they’re not identical.
I certainly hope that the training process of an advanced AI system leads to safety, but I’m not assuming that in this piece, as per the background beliefs.
It counts if the neural network’s outputs are related to actuators that can plausibly cause existential risk or other catastrophe. As such, I think these forms of interpretability techniques are suspect, but could be fine (e.g. if you could somehow construct a sandbox environment to test your neural network where the network’s sandboxed behaviour was informative about whether the network would cause catastrophe in the real world). That being said, I’m confused by this question, because I don’t think I claimed in the piece that typical interpretability techniques were useful.
I am basically abstracting away from the problem of figuring out when your neural network has access to actuators that can pose existential risk or other catastrophe, and hope somebody else solves this. I’d hope that in the training schemes you describe, you can determine that the agent won’t cause catastrophe before its first exposure to the real world, otherwise such a scheme seems irresponsible for systems that could cause catastrophes.
Here are two claims:
“If I were in charge of the world, I would ensure that no powerful AI system were deployed unless we had mechanistic transparency into that system, because anything short of that is an unacceptable level of risk”
“I think that we should push for mechanistic transparency, because by doing so we will cause developers not to deploy dangerous AI systems, because they will use mechanistic transparency techniques to identify when the AI system is dangerous”
There is an axis on which these two claims differ, where I would say the first one is normative and the second one is empirical. The phrase “perfect is the enemy of good” is also talking about this axis. What would you name that axis?
In any case, probably at this point you know what I mean. I would like to see more argumentation for the second kind of claim, and am trying to say that arguments for the first kind of claim are not likely to sway me.
Re: clarification of desideratum, that makes sense.
Re: the two claims, that’s different from what I thought you meant by the distinction. I would describe both dot points as being normative claims buttressed by empirical claims. To the extent that I see a difference, it’s that the first dot point is perhaps addressing low-probability risks, while the second is addressing medium-to-high-probability risks. I think that pushing for mechanistic transparency would address medium-to-high-probability risks, but don’t argue for that here, since I think the arguments for medium-to-high-probability risk from AI are better made elsewhere.
Hmm, I was more pointing at the distinction where the first claim doesn’t need to argue for the subclaim “we will be able to get people to use mechanistic transparency” (it’s assumed away by “if I were in charge of the world”), while the second claim does have to argue for it.
The way I read this, if the research community enables the developers to determine these things at prohibitive cost, then we mostly haven’t “allowed” them to do it, but if the cost is manageable then we have. So I’d say my desiderata here (and also in my head) include the cost being manageable. If the cost of any such approach were necessarily prohibitive, I wouldn’t be very excited about it.
Background beliefs
I do include those sorts of things as guarantees. I do think it’s possible that adversarial training will provide such guarantees, but I think it’s difficult for the reasons that I’ve mentioned, and that a sufficient adversary will itself need to have a good deal of transparency into the system in order to come up with cases where the system will fail.
Thanks for the detailed comment! As is typical for me, I’ll respond to the easiest and least important part first.
Short NBA players have existed: according to Wikipedia, Muggsy Bogues was 1.60 m tall (or 5 feet 3 inches) and active until 2001. The shortest currently existing NBA player is Isaiah Thomas, who is 1.75 m tall (or 5 feet 9 inches). This is apparently basically the median male height in the USA (surprisingly-to-me, both among all Americans and among African-Americans).
I was mostly trying to illustrate the point, but if you want a different example:
or
I greatly appreciate writing your thoughts up. I have a few questions about your agenda/optimism regarding particular approaches.
Let me know if you’d agree with the following. The mechanistic approach is about understanding the internal structure of a program and how it behaves on arbitrary inputs. Mechanistic transparency is quite different from the more typical meaning of interpretability where we would like to know why an AI did something on a particular input.
We could consider the following algorithms mechanistically transparent:
A small decision tree
The minimax algorithm
An explicit expected utility maximizer with a simple understandable utility function
Quicksort
We could consider the following algorithms interpretable but not necessarily mechanistically transparent:
A large decision tree
k-nearest neighbors on a 2 dimensional input
A human who is asked to show their work on an exam
I have two main questions:
First, it seems like algorithms that are mechanistically transparent mainly derive their transparency from having a simple core mathematical backbone. But as Wei Dai pointed out, “My guess is that if you took a human-level AGI that was the result of something like deep learning optimizing only for capability (and not understandability), and tried to interpret it as pseudocode, you’ll end up with so many modules with so many interactions between them that no human or team of humans could understand it. In other words, you’ll end up with spaghetti code written by a superintelligence (meaning the training process).” I am finding it hard to believe that there will be a simple basin that we can regularize an AGI into. Do you agree? If so, why do you think that the mechanistic approach is more promising?
Second, why is mechanistic transparency important in the first place? What places do you concretely see it being helpful for understanding how systems work, specifically with respect to alignment?
To understand why I’m asking the question better, let’s imagine a human (who is interpretable but not mechanistically transparent) and a mechanistically transparent minimax robot playing a game of chess. In the midgame, I ask the human why they moved their queen into the enemy’s territory.
“Do you see a route to checkmate from here?” I ask. “No. I just wanted to get more aggressive. I am setting up to move my bishop in next, and I will try to see if I can force a defeat from there.”
The robot responds by moving their rook forward, and I ask the robot why they did that. They reply, “I analyzed 918912 moves and countermoves and discovered that this one had the minimum possible loss out of all possible countermoves from my opponent, using this scoring system for the loss.”
Now, I ask, if we wanted to learn what mistakes each algorithm was making, what type of transparency helps more in your opinion?
I agree with your sentence about the mechanistic approach. I think the word “interpretable” has very little specific meaning, but most work is about particular inputs. I agree that your examples divide up into what I would consider mechanistically transparent vs not, depending on exactly how large the decision tree, but I can’t speak to whether they all count as “interpretable”.
I think it’s plausible that there will be a simple basin that we can regularise an AGI into, because I have some ideas about how to do it, and because the world hasn’t thought very hard about the problem yet (meaning the lack of extant solutions is to some extent explained away). I also think that there exists a relatively simple mathematical backbone to intelligence to be found (but not that all intelligent systems have this backbone), because I think promising progress has been made in mathematising a bunch of relevant concepts (see probability theory, utility theory, AIXI, reflective oracles). But this might be a bias from ‘growing up’ academically in Marcus Hutter’s lab.
You haven’t deployed a system, don’t know the kinds of situations it might encounter, and want reason to believe that it will perform well (e.g. by not trying to kill everyone) in these situations that you can’t simulate. That being said, I have the feeling that this answer isn’t satisfactorily detailed, so maybe you want more detail, or are thinking of a critique I haven’t thought of?
In this situation, the first answer is more likely to reveal some specific high-level mistakes the player might make, and provides affordance for a chess player to give advice for how to improve. The second answer seems like it’s more amenable to mathematical analysis, generalises better across boards, less likely to be confabulated, and provides a better handle for how to directly improve the algorithm (basically, read forward more than one move). So I guess the first answer better reveals chess mistakes, and the second better reveals cognitive mistakes.
That makes sense. More pessimistically, one could imagine that the reason why no one has thought very hard about it is because in practice, it doesn’t really help you that much to have a mechanistic understanding of a neural network in order to do useful work. Though perhaps as AI becomes more ‘agentic’ you think that will cease to be the case?
I had read your comment thread on Realism about Rationality a while back, and I was under the impression that your stance was something like “rationality is as real as liberalism” or something like that. A relatively simple backbone in the same ballpark as probability theory, utility theory etc. seems way more realist than that.
I also have an intuition for why focusing on these mathematical theories might bias us towards thinking that intelligence can be described mathematically, but it’s a difficult intuition to convey, so bear with me.
First, an observation: the reason why the simple theories of intelligence don’t produce intelligence in practice is because direct computations for them are extremely expensive. There are ways to reduce the compute draw for them to work, but the “things you do to increase compute efficiency of intelligence” is arguably the hardest part about building intelligent machines, and the part that makes up the majority of conceptual space for understanding them. Therefore, understanding real-world intelligent machines requires mostly understanding the tricks they do to be compute-efficient, rather than understanding the mathematical underpinnings.
This intuition is a bit vague, but maybe you saw what I was trying to say?
I care primarily about AI deception at the moment, and I suspect the biggest reason an AI would deceive us is because it received an input that was off-distribution that caused it to act weird. Input-specific interpretability allows us to detect those cases when they arise. Mechanistic transparency might help, but only if the mathematical description of the AI is amenable to real-world analysis.
Most likely, a mathematical description will be long and complex, and the developers will have to pay a high cost to understand how the description could imply deception (But given what you said above about a simple basin, I think this is probably a crux).
I’ll just respond to the easy part of this for now.
That’s not what I said. Because it takes ages to scroll down to comments and I’m on my phone, I can’t easily link to the relevant comments, but basically I said that rationality is probably as formalisable as electromagnetism, but that theories as precise as that of liberalism can still be reasoned about and built on.
That’s fair. I didn’t actually quite understand what your position was and was trying to clarify.
FWIW I take this work on ‘circuits’ in an image recognition CNN to be a bullish indicator for the possibility of mechanistic transparency.
I think I just think the ‘market’ here is ‘inefficient’? Like, I think this just isn’t a thing that people have really thought of, and those that have have gained semi-useful insight into neural networks by doing similar things (e.g. figuring out that adding a picture of a baseball to a whale fin will cause a network to misclassify the image as a great white shark). It also seems to me that recognition tasks (as opposed to planning/reasoning tasks) are going to be the hardest to get this kind of mechanistic transparency for, and also the kinds of tasks where transparency is easiest and ML systems are best.
I think I understand what you mean here, but also think that there can be tricks that reduce computational cost that have some sort of mathematical backbone—it seems to me that this is common in the study of algorithms. Note also that we don’t have to understand all possible real-world intelligent machines, just the ones that we build, making the requirement less stringent.
Asya’s summary for the Alignment Newsletter:
My opinion:
This is great, thanks! A whole new take on what our goal is!
It’s especially exciting to me because occasionally in conversation I’ve said things like “OK so if we do more decision theory research, we can find and understand cases in which having the wrong decision theory can get us killed, and then that in turn can guide AI research—people can keep those cases in mind when designing test suites and stuff” and people have been like “Nah, that isn’t realistic, no one is going to listen to you talk about hypothetical failure modes” or something. (My memory is fuzzy) Now that you’ve written this post, I have a clearer sense of the background path-to-impact that I must have been having in mind when saying things like that, and also a clearer sense of what the objections are.
On that note, would you agree that the example I sketched above is an example of the sort of thing that fits in your project? Or is finding decision-theoretic problem cases not part of agent foundations, or not relevantly similar enough in your mind?
It seems pretty closely linked to the agent foundations side of this perspective, but I’d say “my project” for the duration of my PhD is on the transparency side.
Also, it’s gratifying to hear this post was useful to someone other than me :)
FYI, it would be useful to know if people liked having the workflowy link.
I liked it.
You already touch on this some, but do you imagine this perspective allowing you, at least ideally, to create a “complete” filter in the sense that the filtering process would be capable to catching all unsafe and unaligned AI? If so, what are the criteria under which you might be able to achieve that and if not I’m curious what predictable gaps you expect your filter to have?
(I think you’ve already given a partial answer in your post, but given the way you set up this post with talk about the filter it made me curious to understand what you explicitly think this aspect of it.)
I guess I’m imagining transparency tools that combine to say “OK”, “dangerous”, or “don’t know”, and the question is how often it has to answer “don’t know”. Given that analysis tools typically only work for certain types of systems, and ML training takes many forms, I suppose you’ll need to take some pains to ensure that your system is compatible with existing transparency tools. But I haven’t explicitly thought about this very much, and am just giving a quick answer.
Or at least not in a recognisably relevant-to-your-question way.