I think my argument was more like “in the world where your modularity research works out perfectly, you get linear scaling, and then it still costs 100x to have a mechanistically-understood AI system relative to a black-box AI system, which seems prohibitively expensive”.
I guess I don’t understand why linear scaling would imply this—in fact, I’d guess that training should probably be super-linear, since each backward pass takes linear time, but the more neurons you have, the bigger the parameter space, and so the greater number of gradient steps you need to take to reach the optimum, right?
At any rate, I agree that 100x cost is probably somewhat too expensive. If that estimate comes from OpenAI’s efforts to understand image recognition, I think it’s too high, since we presumably learned a bunch about what to look for from their efforts. I also think you’re underweighing the benefits of having a better theory of how effective cognition is structured. Responding to your various bullet points:
Right now we’re working with subhuman AI systems where we already have concepts that we can use to understand AI systems; this will become much more difficult with superhuman AI systems.
I can’t think of any way around the fact that this will likely make the work harder. Ideally it would bring incidental benefits, though (once you understand new super-human concepts you can use them in other systems).
All abstractions are leaky; as you build up hierarchies of abstractions for mechanistically understanding a neural net, the problems with your abstractions can cause you to miss potential problems.
Once you have a model of a module such that if the module worked according to your model things would be fine, you can just train the module to better fit your model. Hopefully by re-training the modules independently, to the extent you have errors they’re uncorrelated and result in reduced performance rather than catastrophic failure.
With image classifiers we have the benefit of images being an input mechanism we are used to; it will presumably be a lot harder with input mechanisms we aren’t used to.
I think this is a minor benefit. In most domains, specialists will understand the meanings of input data to their systems: I can’t think of any counterexamples, but perhaps you can. Then, once you understand the initial modules, you can understand their outputs in terms of their inputs, and by recursion it seems like you should be able to understand the inputs and outputs of all modules.
This paper on scaling laws for training language models seems like it should help us make a rough guess for how training scales. According to the paper, your loss in nats if you’re only limited by cost C scales as C−0.05, and if you’re only limited by number of parameters N it scales with N−0.08. If we can equate those in the limit, which is not at all obvious to me, that suggests that cost goes as number of parameters to the 1.6 power, and number of parameters itself is polynomial in the number of neurons. So, the comprehension can be a little polynomial in the number of neurons, but it certainly can’t be exponential.
Yup, that seems like a pretty reasonable estimate to me.
Note that my default model for “what should be the input to estimate difficulty of mechanistic transparency” would be the number of parameters, not number of neurons. If a neuron works over a much larger input (leading to more parameters), wouldn’t that make it harder to mechanistically understand?
I guess I don’t understand why linear scaling would imply this—in fact, I’d guess that training should probably be super-linear, since each backward pass takes linear time, but the more neurons you have, the bigger the parameter space, and so the greater number of gradient steps you need to take to reach the optimum, right?
Yeah, that’s plausible. This does mean the mechanistic transparency cost could scale sublinearly w.r.t compute cost, though I doubt it (for the other reasons I mentioned).
If that estimate comes from OpenAI’s efforts to understand image recognition, I think it’s too high, since we presumably learned a bunch about what to look for from their efforts.
Nah, I just pulled a number out of nowhere. The estimate based on existing efforts would be way higher. Back of the envelope: it costs ~$50 to train on ImageNet (see here). Meanwhile, there have been probably around 10 person-years spent on understanding one image classifier? At $250k per person-year, that’s $2.5 million on understanding, making it 50,000x more expensive to understand it than to train it.
Things that would move this number down:
Including the researcher time in the cost to train on ImageNet. I think that we will soon (if we haven’t already) enter the regime where researcher cost < compute cost, so that would only change the conclusion by a factor of at most 2.
Using the cost for an unoptimized implementation, which would probably be > $50. I’d expect those optimizations to already be taken for the systems we care about—it’s way more important to get a 2x cost reduction when your training run costs $100 million than when your training run costs under $1000.
Including the cost of hyperparameter tuning. This also seems like a thing we will cause to be no more than a factor of 2, e.g. by using population-based training of hyperparameters.
Including the cost of data collection. This seems important, future data collection probably will be very expensive (even if simulating, there’s the compute cost of the simulation), but idk how to take it into account. Maybe decrease the estimate by a factor of 10?
Once you have a model of a module such that if the module worked according to your model things would be fine, you can just train the module to better fit your model.
You could also just use the model, if it’s fast. It would be interesting to see how well this works. My guess is that abstractions are leaky because there are no good non-leaky abstractions, which would predict that this doesn’t work very well.
I think this is a minor benefit. In most domains, specialists will understand the meanings of input data to their systems
I think this is basically just the same point as “the problem gets harder when the AI system is superhuman”, except the point is that the AI system becomes superhuman way faster on domains that are not native to humans, e.g. DNA, drug structures, protein folding, math intuition, relative to domains that are native to humans, like image classification.
The costs of mechanistic transparency
I guess I don’t understand why linear scaling would imply this—in fact, I’d guess that training should probably be super-linear, since each backward pass takes linear time, but the more neurons you have, the bigger the parameter space, and so the greater number of gradient steps you need to take to reach the optimum, right?
At any rate, I agree that 100x cost is probably somewhat too expensive. If that estimate comes from OpenAI’s efforts to understand image recognition, I think it’s too high, since we presumably learned a bunch about what to look for from their efforts. I also think you’re underweighing the benefits of having a better theory of how effective cognition is structured. Responding to your various bullet points:
I can’t think of any way around the fact that this will likely make the work harder. Ideally it would bring incidental benefits, though (once you understand new super-human concepts you can use them in other systems).
Once you have a model of a module such that if the module worked according to your model things would be fine, you can just train the module to better fit your model. Hopefully by re-training the modules independently, to the extent you have errors they’re uncorrelated and result in reduced performance rather than catastrophic failure.
I think this is a minor benefit. In most domains, specialists will understand the meanings of input data to their systems: I can’t think of any counterexamples, but perhaps you can. Then, once you understand the initial modules, you can understand their outputs in terms of their inputs, and by recursion it seems like you should be able to understand the inputs and outputs of all modules.
This paper on scaling laws for training language models seems like it should help us make a rough guess for how training scales. According to the paper, your loss in nats if you’re only limited by cost C scales as C−0.05, and if you’re only limited by number of parameters N it scales with N−0.08. If we can equate those in the limit, which is not at all obvious to me, that suggests that cost goes as number of parameters to the 1.6 power, and number of parameters itself is polynomial in the number of neurons. So, the comprehension can be a little polynomial in the number of neurons, but it certainly can’t be exponential.
Yup, that seems like a pretty reasonable estimate to me.
Note that my default model for “what should be the input to estimate difficulty of mechanistic transparency” would be the number of parameters, not number of neurons. If a neuron works over a much larger input (leading to more parameters), wouldn’t that make it harder to mechanistically understand?
Yeah, that’s plausible. This does mean the mechanistic transparency cost could scale sublinearly w.r.t compute cost, though I doubt it (for the other reasons I mentioned).
Nah, I just pulled a number out of nowhere. The estimate based on existing efforts would be way higher. Back of the envelope: it costs ~$50 to train on ImageNet (see here). Meanwhile, there have been probably around 10 person-years spent on understanding one image classifier? At $250k per person-year, that’s $2.5 million on understanding, making it 50,000x more expensive to understand it than to train it.
Things that would move this number down:
Including the researcher time in the cost to train on ImageNet. I think that we will soon (if we haven’t already) enter the regime where researcher cost < compute cost, so that would only change the conclusion by a factor of at most 2.
Using the cost for an unoptimized implementation, which would probably be > $50. I’d expect those optimizations to already be taken for the systems we care about—it’s way more important to get a 2x cost reduction when your training run costs $100 million than when your training run costs under $1000.
Including the cost of hyperparameter tuning. This also seems like a thing we will cause to be no more than a factor of 2, e.g. by using population-based training of hyperparameters.
Including the cost of data collection. This seems important, future data collection probably will be very expensive (even if simulating, there’s the compute cost of the simulation), but idk how to take it into account. Maybe decrease the estimate by a factor of 10?
You could also just use the model, if it’s fast. It would be interesting to see how well this works. My guess is that abstractions are leaky because there are no good non-leaky abstractions, which would predict that this doesn’t work very well.
I think this is basically just the same point as “the problem gets harder when the AI system is superhuman”, except the point is that the AI system becomes superhuman way faster on domains that are not native to humans, e.g. DNA, drug structures, protein folding, math intuition, relative to domains that are native to humans, like image classification.