Note that I’m not asking for systems to be mechanistically transparent to people with backgrounds and training in the relevant field, just that they be mechanistically transparent to their developers. This is still difficult, but as far as I know it’s possible for laptops (although I could be wrong about this, I’m not a laptop expert).
I’d be shocked if there was anyone to whom it was mechanistically transparent how a laptop loads a website, down to the gates in the laptop.
I’d be surprised if there was anyone to whom it was mechanistically transparent how a laptop boots up, down to the gates in the laptop. (Note you’d have to understand the entire BIOS as well as all of the hardware in the laptop.)
Whoa, I’m so confused by that. It seems pretty clear to me that it’s easier to regularise for properties that have nicer, more ‘mathematical’ definitions, and if that’s false then I might just be fundamentally misunderstanding something.
It’s easier in the sense that it’s easier to compute it in Tensorflow and then use gradient descent to make the number smaller / bigger. But if you ignore that factor and ask whether a more mathematical definition will lead to more human-interpretability, then I don’t see a particular reason to expect mathematical definitions to work better.
This basically seems right to me, and as such I’m researching how to make networks modular and identify their modularity structure. It feels to me like this research is doing OK and is not obviously doomed.
I think my argument was more like “in the world where your modularity research works out perfectly, you get linear scaling, and then it still costs 100x to have a mechanistically-understood AI system relative to a black-box AI system, which seems prohibitively expensive”. And that’s without including a bunch of other difficulties:
Right now we’re working with subhuman AI systems where we already have concepts that we can use to understand AI systems; this will become much more difficult with superhuman AI systems.
All abstractions are leaky; as you build up hierarchies of abstractions for mechanistically understanding a neural net, the problems with your abstractions can cause you to miss potential problems. (As an analogy, when programming without any APIs / external code, you presumably mechanistically understand the code you write; yet bugs are common in such programming.)
With image classifiers we have the benefit of images being an input mechanism we are used to; it will presumably be a lot harder with input mechanisms we aren’t used to.
It is certainly not unimaginable to me that these problems get solved somehow, but to convince me to promote this particular story for AI alignment to attention (at least beyond the threshold of “a smart person I know is excited about it”), you’d need to have some story / hope for how to deal with these problems. (E.g. as you mention in your post, you could imagine dealing with the last one using something like iterated amplification? Maybe?)
Here are some other stories for preventing catastrophes:
Regulations / laws to not build powerful AI
Increasing AI researcher paranoia, so all AI researchers are very careful with powerful AI systems
BoMAI-style boxing (“all the powerful AI systems we build don’t care about anything that would make catastrophe instrumentally useful”)
Impact regularization (“all the AI systems we build don’t want to do something as high-impact as a catastrophe”)
Safety benchmarks (set of tests looking for common problems, updated as we encounter new problems) (“all the potentially dangerous AI systems we could have built failed one of the benchmark tests”)
Any of the AI alignment methods, e.g. value learning or iterated amplification (“we don’t build dangerous AI systems because we build aligned AI systems instead”)
Currently I find all of these stories more plausible than the story “we don’t deploy a dangerous AI system because the developers mechanistically understood the dangerous AI system, detected the danger, and decided not to deploy it”.
I want to emphasize that I think the general research direction is good and will be useful and I want more people to work on it (it makes the first, second, fifth and sixth bullet points above more effective); I only disagree with the story you’ve presented for how it reduces x-risk.
How this perspective could reduce the probability of catastrophes
I want to emphasize that I think the general research direction is good and will be useful and I want more people to work on it (it makes the first, second, fifth and sixth bullet points above more effective); I only disagree with the story you’ve presented for how it reduces x-risk.
To be clear: the way I imagine this research direction working is that somebody comes up with a theory of how to build aligned AI, roughly does that, and then uses some kind of transparency to check whether or not they succeeded. A big part of the attraction to me is that it doesn’t really depend on what exact way aligned AI gets built, as long as it’s built using methods roughly similar to modern neural network training. That being said, if it’s as hard as you think it will be, I don’t understand how it could usefully contribute to the dot points you mention.
That being said, if it’s as hard as you think it will be, I don’t understand how it could usefully contribute to the dot points you mention.
Taking each of the bullet points I mentioned in turn:
Regulations / laws to not build powerful AI
You could imagine a law “we will not build AI systems that use >X amount of compute unless they are mechanistically transparent”. Then research on mechanistic transparency reduces the cost of such a law, making it more palatable to implement it.
Increasing AI researcher paranoia, so all AI researchers are very careful with powerful AI systems
The most obvious way to do this is to demonstrate that powerful AI systems are dangerous. One very compelling demonstration would be to train an AI system that we expect to be deceptive (that isn’t powerful enough to take over), make it mechanistically transparent, and show that it is deceptive.
Here, the mechanistic transparency would make the demonstration much more compelling (relative to a demonstration where you show deceptive behavior, but there’s the possibility that it’s just a weird bug in that particular scenario).
Safety benchmarks (set of tests looking for common problems, updated as we encounter new problems) (“all the potentially dangerous AI systems we could have built failed one of the benchmark tests”)
Mechanistic transparency opens up the possibility for safety tests of the form “train an AI system on this environment, and then use mechanistic transparency to check if it has learned <prohibited cognition>”. (You could imagine that the environment is small, or the models trained are small, and that’s why the cost of mechanistic transparency isn’t prohibitive.)
Any of the AI alignment methods, e.g. value learning or iterated amplification (“we don’t build dangerous AI systems because we build aligned AI systems instead”)
Informed oversight can be solved via universality or interpretability; worst-case optimization currently relies on “magic” interpretability techniques. Even if full mechanistic transparency is too hard to do, I would expect that insights along the way would be helpful. For example, perhaps in adversarial training, if the adversary shares weights with the agent, the adversary already “knows” what the agent is “thinking”, but it might need to use mechanistic transparency just for the final layer to understand what that part is doing.
You could imagine a law “we will not build AI systems that use >X amount of compute unless they are mechanistically transparent”. Then research on mechanistic transparency reduces the cost of such a law, making it more palatable to implement it.
If mechanistic transparency barely works and/or is super expensive, then presumably this law doesn’t look very good compared to other potential laws that prevent the building of powerful AI, so you’d think that marginal changes in how good we are at mechanistic transparency would do basically nothing (unless you’ve got the hope of ‘crossing the threshold’ to the point where this law becomes the most viable such law).
I think my argument was more like “in the world where your modularity research works out perfectly, you get linear scaling, and then it still costs 100x to have a mechanistically-understood AI system relative to a black-box AI system, which seems prohibitively expensive”.
I guess I don’t understand why linear scaling would imply this—in fact, I’d guess that training should probably be super-linear, since each backward pass takes linear time, but the more neurons you have, the bigger the parameter space, and so the greater number of gradient steps you need to take to reach the optimum, right?
At any rate, I agree that 100x cost is probably somewhat too expensive. If that estimate comes from OpenAI’s efforts to understand image recognition, I think it’s too high, since we presumably learned a bunch about what to look for from their efforts. I also think you’re underweighing the benefits of having a better theory of how effective cognition is structured. Responding to your various bullet points:
Right now we’re working with subhuman AI systems where we already have concepts that we can use to understand AI systems; this will become much more difficult with superhuman AI systems.
I can’t think of any way around the fact that this will likely make the work harder. Ideally it would bring incidental benefits, though (once you understand new super-human concepts you can use them in other systems).
All abstractions are leaky; as you build up hierarchies of abstractions for mechanistically understanding a neural net, the problems with your abstractions can cause you to miss potential problems.
Once you have a model of a module such that if the module worked according to your model things would be fine, you can just train the module to better fit your model. Hopefully by re-training the modules independently, to the extent you have errors they’re uncorrelated and result in reduced performance rather than catastrophic failure.
With image classifiers we have the benefit of images being an input mechanism we are used to; it will presumably be a lot harder with input mechanisms we aren’t used to.
I think this is a minor benefit. In most domains, specialists will understand the meanings of input data to their systems: I can’t think of any counterexamples, but perhaps you can. Then, once you understand the initial modules, you can understand their outputs in terms of their inputs, and by recursion it seems like you should be able to understand the inputs and outputs of all modules.
This paper on scaling laws for training language models seems like it should help us make a rough guess for how training scales. According to the paper, your loss in nats if you’re only limited by cost C scales as C−0.05, and if you’re only limited by number of parameters N it scales with N−0.08. If we can equate those in the limit, which is not at all obvious to me, that suggests that cost goes as number of parameters to the 1.6 power, and number of parameters itself is polynomial in the number of neurons. So, the comprehension can be a little polynomial in the number of neurons, but it certainly can’t be exponential.
Yup, that seems like a pretty reasonable estimate to me.
Note that my default model for “what should be the input to estimate difficulty of mechanistic transparency” would be the number of parameters, not number of neurons. If a neuron works over a much larger input (leading to more parameters), wouldn’t that make it harder to mechanistically understand?
I guess I don’t understand why linear scaling would imply this—in fact, I’d guess that training should probably be super-linear, since each backward pass takes linear time, but the more neurons you have, the bigger the parameter space, and so the greater number of gradient steps you need to take to reach the optimum, right?
Yeah, that’s plausible. This does mean the mechanistic transparency cost could scale sublinearly w.r.t compute cost, though I doubt it (for the other reasons I mentioned).
If that estimate comes from OpenAI’s efforts to understand image recognition, I think it’s too high, since we presumably learned a bunch about what to look for from their efforts.
Nah, I just pulled a number out of nowhere. The estimate based on existing efforts would be way higher. Back of the envelope: it costs ~$50 to train on ImageNet (see here). Meanwhile, there have been probably around 10 person-years spent on understanding one image classifier? At $250k per person-year, that’s $2.5 million on understanding, making it 50,000x more expensive to understand it than to train it.
Things that would move this number down:
Including the researcher time in the cost to train on ImageNet. I think that we will soon (if we haven’t already) enter the regime where researcher cost < compute cost, so that would only change the conclusion by a factor of at most 2.
Using the cost for an unoptimized implementation, which would probably be > $50. I’d expect those optimizations to already be taken for the systems we care about—it’s way more important to get a 2x cost reduction when your training run costs $100 million than when your training run costs under $1000.
Including the cost of hyperparameter tuning. This also seems like a thing we will cause to be no more than a factor of 2, e.g. by using population-based training of hyperparameters.
Including the cost of data collection. This seems important, future data collection probably will be very expensive (even if simulating, there’s the compute cost of the simulation), but idk how to take it into account. Maybe decrease the estimate by a factor of 10?
Once you have a model of a module such that if the module worked according to your model things would be fine, you can just train the module to better fit your model.
You could also just use the model, if it’s fast. It would be interesting to see how well this works. My guess is that abstractions are leaky because there are no good non-leaky abstractions, which would predict that this doesn’t work very well.
I think this is a minor benefit. In most domains, specialists will understand the meanings of input data to their systems
I think this is basically just the same point as “the problem gets harder when the AI system is superhuman”, except the point is that the AI system becomes superhuman way faster on domains that are not native to humans, e.g. DNA, drug structures, protein folding, math intuition, relative to domains that are native to humans, like image classification.
I’d be shocked if there was anyone to whom it was mechanistically transparent how a laptop loads a website, down to the gates in the laptop.
So, I don’t think I’m saying that you have to mechanistically understand how every single gate works—rather, that you should be able to understand intermediate-level sub-systems and how they combine to produce the functionality of the laptop. The understanding of the intermediate-level sub-systems has to be pretty complete, but probably need not be totally complete—in the laptop case, you can just model a uniform random error rate and you’ll be basically right, and I imagine there should be something analogous with neural networks. Of course, you need somebody to be in charge of understanding the neurons in order to build to your understanding of the intermediate-level sub-systems, but it doesn’t seem to me that there needs to be any single person who understands all the neurons entirely—or indeed even any single person who needs to understand all the intermediate-level sub-systems entirely.
I think I should not have used the laptop example, it’s not really communicating what I meant it to communicate. I was trying to convey “mechanistic transparency is hard” rather than “mechanistic transparency requires a single person to understand everything”.
I guess I still don’t understand why you believe mechanistic transparency is hard. The way I want to use the term, as far as I can tell, laptops are acceptably mechanistically transparent to the companies that create them. Do you think I’m wrong?
No, which is why I want to stop using the example.
(The counterfactual I was thinking of was more like “imagine we handed a laptop to 19th-century scientists, can they mechanistically understand it?” But even that isn’t a good analogy, it overstates the difficulty.)
I’d be shocked if there was anyone to whom it was mechanistically transparent how a laptop loads a website, down to the gates in the laptop.
Could you clarify why this is an important counterpoint. It seems obviously useful to understand mechanistic details of a laptop in order to debug it. You seem to be arguing the [ETA: weaker] claim that nobody understands the an entire laptop “all at once”, as in, they can understand all the details in their head simultaneously. But such an understanding is almost never possible for any complex system, and yet we still try to approach it. So this objection could show that mechanistic transparency is hard in the limit, but it doesn’t show that mechanistic transparency is uniquely bad in any sense. Perhaps you disagree?
You seem to be arguing the stronger claim that nobody understands the an entire laptop “all at once”
weaker claim?
But such an understanding is almost never possible for any complex system, and yet we still try to approach it. So this objection could show that mechanistic transparency is hard in the limit, but it doesn’t show that mechanistic transparency is uniquely bad in any sense.
This seems to be assuming that we have to be able to take any complex trained AGI-as-a-neural-net and determine whether or not it is dangerous. Under that assumption, I agree that the problem is itself very hard, and mechanistic transparency is not uniquely bad relative to other possibilities.
But my point is that because it is so hard to detect whether an arbitrary neural net is dangerous, you should be trying to solve a different problem. This only depends on the claim that mechanistic transparency is hard in an absolute sense, not a relative sense (given the problem it is trying to solve).
Put another way: once you’re playing the game where I can hand you any model and then you have to figure out whether it’s deceptive or not, you’ve already lost. Instead, you want to be in the regime where your training process is constructed so as to steer clear of situations in which your model might become deceptive in the first place.
All of the other stories for preventing catastrophe that I mentioned in the grandparent are tackling a hopefully easier problem than “detect whether an arbitrary neural net is dangerous”.
Oops yes. That’s the weaker claim, that I agree with. The stronger claim is that because we can’t understand something “all at once” then mechanistic transparency is too hard and so we shouldn’t take Daniel’s approach. But the way we understand laptops is also in a mechanistic sense. No one argues that because laptops are too hard to understand all at once, then we should’t try to understand them mechanistically.
This seems to be assuming that we have to be able to take any complex trained AGI-as-a-neural-net and determine whether or not it is dangerous. Under that assumption, I agree that the problem is itself very hard, and mechanistic transparency is not uniquely bad relative to other possibilities.
I didn’t assume that. I objected to the specific example of a laptop as an instance of mechanistic transparency being too hard. Laptops are normally understood well because understanding can be broken into components and built up from abstractions. But each our understanding of each component and abstraction is pretty mechanistic—and this understanding is useful.
Furthermore, because laptops did not fall out of the sky one day, but instead slowly built over successive years of research and development, it seems like a great example of how Daniel’s mechanistic transparency approach does not rely on us having to understand arbitrary systems. Just as we built up an understanding of laptops, presumably we could do the same with neural networks. This was my interpretation of why he is using Zoom In as an example.
All of the other stories for preventing catastrophe that I mentioned in the grandparent are tackling a hopefully easier problem than “detect whether an arbitrary neural net is dangerous”.
Indeed, but I don’t think this was the crux of my objection.
The story you have is “the developers build a few small neural net modules that do one thing, mechanistically understand those modules, then use those modules to build newer modules that do ‘bigger’ things, and mechanistically understand those, and keep iterating this until they have an AGI”. Does that sound right to you? If so, I agree that by following such a process the developer team could get mechanistic transparency into the neural net the same way that laptop-making companies have mechanistic transparency into laptops.
The story I took away from this post is “we do end-to-end training with regularization for modularity, and then we get out a neural net with modular structure. We then need to understand this neural net mechanistically to ensure it isn’t dangerous”. This seems much more analogous to needing to mechanistically understand a laptop that “fell out of the sky one day” before we had ever made a laptop.
My critiques are primarily about the second story. My critique of the first story would be that it seems like you’re sacrificing a lot of competitiveness by having to develop the modules one at a time, instead of using end-to-end training.
You could imagine a synthesis of the two stories: train a medium-level smart thing end-to-end, look at what all the modules are doing, and use those modules when training a smarter thing.
I’d be shocked if there was anyone to whom it was mechanistically transparent how a laptop loads a website, down to the gates in the laptop.
I’d be surprised if there was anyone to whom it was mechanistically transparent how a laptop boots up, down to the gates in the laptop. (Note you’d have to understand the entire BIOS as well as all of the hardware in the laptop.)
It’s easier in the sense that it’s easier to compute it in Tensorflow and then use gradient descent to make the number smaller / bigger. But if you ignore that factor and ask whether a more mathematical definition will lead to more human-interpretability, then I don’t see a particular reason to expect mathematical definitions to work better.
I think my argument was more like “in the world where your modularity research works out perfectly, you get linear scaling, and then it still costs 100x to have a mechanistically-understood AI system relative to a black-box AI system, which seems prohibitively expensive”. And that’s without including a bunch of other difficulties:
Right now we’re working with subhuman AI systems where we already have concepts that we can use to understand AI systems; this will become much more difficult with superhuman AI systems.
All abstractions are leaky; as you build up hierarchies of abstractions for mechanistically understanding a neural net, the problems with your abstractions can cause you to miss potential problems. (As an analogy, when programming without any APIs / external code, you presumably mechanistically understand the code you write; yet bugs are common in such programming.)
With image classifiers we have the benefit of images being an input mechanism we are used to; it will presumably be a lot harder with input mechanisms we aren’t used to.
It is certainly not unimaginable to me that these problems get solved somehow, but to convince me to promote this particular story for AI alignment to attention (at least beyond the threshold of “a smart person I know is excited about it”), you’d need to have some story / hope for how to deal with these problems. (E.g. as you mention in your post, you could imagine dealing with the last one using something like iterated amplification? Maybe?)
Here are some other stories for preventing catastrophes:
Regulations / laws to not build powerful AI
Increasing AI researcher paranoia, so all AI researchers are very careful with powerful AI systems
BoMAI-style boxing (“all the powerful AI systems we build don’t care about anything that would make catastrophe instrumentally useful”)
Impact regularization (“all the AI systems we build don’t want to do something as high-impact as a catastrophe”)
Safety benchmarks (set of tests looking for common problems, updated as we encounter new problems) (“all the potentially dangerous AI systems we could have built failed one of the benchmark tests”)
Any of the AI alignment methods, e.g. value learning or iterated amplification (“we don’t build dangerous AI systems because we build aligned AI systems instead”)
Currently I find all of these stories more plausible than the story “we don’t deploy a dangerous AI system because the developers mechanistically understood the dangerous AI system, detected the danger, and decided not to deploy it”.
I want to emphasize that I think the general research direction is good and will be useful and I want more people to work on it (it makes the first, second, fifth and sixth bullet points above more effective); I only disagree with the story you’ve presented for how it reduces x-risk.
How this perspective could reduce the probability of catastrophes
To be clear: the way I imagine this research direction working is that somebody comes up with a theory of how to build aligned AI, roughly does that, and then uses some kind of transparency to check whether or not they succeeded. A big part of the attraction to me is that it doesn’t really depend on what exact way aligned AI gets built, as long as it’s built using methods roughly similar to modern neural network training. That being said, if it’s as hard as you think it will be, I don’t understand how it could usefully contribute to the dot points you mention.
Taking each of the bullet points I mentioned in turn:
You could imagine a law “we will not build AI systems that use >X amount of compute unless they are mechanistically transparent”. Then research on mechanistic transparency reduces the cost of such a law, making it more palatable to implement it.
The most obvious way to do this is to demonstrate that powerful AI systems are dangerous. One very compelling demonstration would be to train an AI system that we expect to be deceptive (that isn’t powerful enough to take over), make it mechanistically transparent, and show that it is deceptive.
Here, the mechanistic transparency would make the demonstration much more compelling (relative to a demonstration where you show deceptive behavior, but there’s the possibility that it’s just a weird bug in that particular scenario).
Mechanistic transparency opens up the possibility for safety tests of the form “train an AI system on this environment, and then use mechanistic transparency to check if it has learned <prohibited cognition>”. (You could imagine that the environment is small, or the models trained are small, and that’s why the cost of mechanistic transparency isn’t prohibitive.)
Informed oversight can be solved via universality or interpretability; worst-case optimization currently relies on “magic” interpretability techniques. Even if full mechanistic transparency is too hard to do, I would expect that insights along the way would be helpful. For example, perhaps in adversarial training, if the adversary shares weights with the agent, the adversary already “knows” what the agent is “thinking”, but it might need to use mechanistic transparency just for the final layer to understand what that part is doing.
If mechanistic transparency barely works and/or is super expensive, then presumably this law doesn’t look very good compared to other potential laws that prevent the building of powerful AI, so you’d think that marginal changes in how good we are at mechanistic transparency would do basically nothing (unless you’ve got the hope of ‘crossing the threshold’ to the point where this law becomes the most viable such law).
The other bullet points make sense though.
The costs of mechanistic transparency
I guess I don’t understand why linear scaling would imply this—in fact, I’d guess that training should probably be super-linear, since each backward pass takes linear time, but the more neurons you have, the bigger the parameter space, and so the greater number of gradient steps you need to take to reach the optimum, right?
At any rate, I agree that 100x cost is probably somewhat too expensive. If that estimate comes from OpenAI’s efforts to understand image recognition, I think it’s too high, since we presumably learned a bunch about what to look for from their efforts. I also think you’re underweighing the benefits of having a better theory of how effective cognition is structured. Responding to your various bullet points:
I can’t think of any way around the fact that this will likely make the work harder. Ideally it would bring incidental benefits, though (once you understand new super-human concepts you can use them in other systems).
Once you have a model of a module such that if the module worked according to your model things would be fine, you can just train the module to better fit your model. Hopefully by re-training the modules independently, to the extent you have errors they’re uncorrelated and result in reduced performance rather than catastrophic failure.
I think this is a minor benefit. In most domains, specialists will understand the meanings of input data to their systems: I can’t think of any counterexamples, but perhaps you can. Then, once you understand the initial modules, you can understand their outputs in terms of their inputs, and by recursion it seems like you should be able to understand the inputs and outputs of all modules.
This paper on scaling laws for training language models seems like it should help us make a rough guess for how training scales. According to the paper, your loss in nats if you’re only limited by cost C scales as C−0.05, and if you’re only limited by number of parameters N it scales with N−0.08. If we can equate those in the limit, which is not at all obvious to me, that suggests that cost goes as number of parameters to the 1.6 power, and number of parameters itself is polynomial in the number of neurons. So, the comprehension can be a little polynomial in the number of neurons, but it certainly can’t be exponential.
Yup, that seems like a pretty reasonable estimate to me.
Note that my default model for “what should be the input to estimate difficulty of mechanistic transparency” would be the number of parameters, not number of neurons. If a neuron works over a much larger input (leading to more parameters), wouldn’t that make it harder to mechanistically understand?
Yeah, that’s plausible. This does mean the mechanistic transparency cost could scale sublinearly w.r.t compute cost, though I doubt it (for the other reasons I mentioned).
Nah, I just pulled a number out of nowhere. The estimate based on existing efforts would be way higher. Back of the envelope: it costs ~$50 to train on ImageNet (see here). Meanwhile, there have been probably around 10 person-years spent on understanding one image classifier? At $250k per person-year, that’s $2.5 million on understanding, making it 50,000x more expensive to understand it than to train it.
Things that would move this number down:
Including the researcher time in the cost to train on ImageNet. I think that we will soon (if we haven’t already) enter the regime where researcher cost < compute cost, so that would only change the conclusion by a factor of at most 2.
Using the cost for an unoptimized implementation, which would probably be > $50. I’d expect those optimizations to already be taken for the systems we care about—it’s way more important to get a 2x cost reduction when your training run costs $100 million than when your training run costs under $1000.
Including the cost of hyperparameter tuning. This also seems like a thing we will cause to be no more than a factor of 2, e.g. by using population-based training of hyperparameters.
Including the cost of data collection. This seems important, future data collection probably will be very expensive (even if simulating, there’s the compute cost of the simulation), but idk how to take it into account. Maybe decrease the estimate by a factor of 10?
You could also just use the model, if it’s fast. It would be interesting to see how well this works. My guess is that abstractions are leaky because there are no good non-leaky abstractions, which would predict that this doesn’t work very well.
I think this is basically just the same point as “the problem gets harder when the AI system is superhuman”, except the point is that the AI system becomes superhuman way faster on domains that are not native to humans, e.g. DNA, drug structures, protein folding, math intuition, relative to domains that are native to humans, like image classification.
Do we mechanistically understand laptops?
So, I don’t think I’m saying that you have to mechanistically understand how every single gate works—rather, that you should be able to understand intermediate-level sub-systems and how they combine to produce the functionality of the laptop. The understanding of the intermediate-level sub-systems has to be pretty complete, but probably need not be totally complete—in the laptop case, you can just model a uniform random error rate and you’ll be basically right, and I imagine there should be something analogous with neural networks. Of course, you need somebody to be in charge of understanding the neurons in order to build to your understanding of the intermediate-level sub-systems, but it doesn’t seem to me that there needs to be any single person who understands all the neurons entirely—or indeed even any single person who needs to understand all the intermediate-level sub-systems entirely.
I think I should not have used the laptop example, it’s not really communicating what I meant it to communicate. I was trying to convey “mechanistic transparency is hard” rather than “mechanistic transparency requires a single person to understand everything”.
I guess I still don’t understand why you believe mechanistic transparency is hard. The way I want to use the term, as far as I can tell, laptops are acceptably mechanistically transparent to the companies that create them. Do you think I’m wrong?
No, which is why I want to stop using the example.
(The counterfactual I was thinking of was more like “imagine we handed a laptop to 19th-century scientists, can they mechanistically understand it?” But even that isn’t a good analogy, it overstates the difficulty.)
Could you clarify why this is an important counterpoint. It seems obviously useful to understand mechanistic details of a laptop in order to debug it. You seem to be arguing the [ETA: weaker] claim that nobody understands the an entire laptop “all at once”, as in, they can understand all the details in their head simultaneously. But such an understanding is almost never possible for any complex system, and yet we still try to approach it. So this objection could show that mechanistic transparency is hard in the limit, but it doesn’t show that mechanistic transparency is uniquely bad in any sense. Perhaps you disagree?
weaker claim?
This seems to be assuming that we have to be able to take any complex trained AGI-as-a-neural-net and determine whether or not it is dangerous. Under that assumption, I agree that the problem is itself very hard, and mechanistic transparency is not uniquely bad relative to other possibilities.
But my point is that because it is so hard to detect whether an arbitrary neural net is dangerous, you should be trying to solve a different problem. This only depends on the claim that mechanistic transparency is hard in an absolute sense, not a relative sense (given the problem it is trying to solve).
Relatedly, from Evan Hubinger:
All of the other stories for preventing catastrophe that I mentioned in the grandparent are tackling a hopefully easier problem than “detect whether an arbitrary neural net is dangerous”.
Oops yes. That’s the weaker claim, that I agree with. The stronger claim is that because we can’t understand something “all at once” then mechanistic transparency is too hard and so we shouldn’t take Daniel’s approach. But the way we understand laptops is also in a mechanistic sense. No one argues that because laptops are too hard to understand all at once, then we should’t try to understand them mechanistically.
I didn’t assume that. I objected to the specific example of a laptop as an instance of mechanistic transparency being too hard. Laptops are normally understood well because understanding can be broken into components and built up from abstractions. But each our understanding of each component and abstraction is pretty mechanistic—and this understanding is useful.
Furthermore, because laptops did not fall out of the sky one day, but instead slowly built over successive years of research and development, it seems like a great example of how Daniel’s mechanistic transparency approach does not rely on us having to understand arbitrary systems. Just as we built up an understanding of laptops, presumably we could do the same with neural networks. This was my interpretation of why he is using Zoom In as an example.
Indeed, but I don’t think this was the crux of my objection.
Okay, I think I see the miscommunication.
The story you have is “the developers build a few small neural net modules that do one thing, mechanistically understand those modules, then use those modules to build newer modules that do ‘bigger’ things, and mechanistically understand those, and keep iterating this until they have an AGI”. Does that sound right to you? If so, I agree that by following such a process the developer team could get mechanistic transparency into the neural net the same way that laptop-making companies have mechanistic transparency into laptops.
The story I took away from this post is “we do end-to-end training with regularization for modularity, and then we get out a neural net with modular structure. We then need to understand this neural net mechanistically to ensure it isn’t dangerous”. This seems much more analogous to needing to mechanistically understand a laptop that “fell out of the sky one day” before we had ever made a laptop.
My critiques are primarily about the second story. My critique of the first story would be that it seems like you’re sacrificing a lot of competitiveness by having to develop the modules one at a time, instead of using end-to-end training.
You could imagine a synthesis of the two stories: train a medium-level smart thing end-to-end, look at what all the modules are doing, and use those modules when training a smarter thing.