EDIT: Nuance of course being impossible, this no doubt comes off as rude—and is in turn a reaction to an internet-distorted version of what you actually wrote. Oh well, grain of salt and all that.
The way you get safety by design is understanding what’s going on inside the neural networks.
This is equivocation. There are some properties of what’s going on inside a NN that are crucial to reasoning about its safety properties, and many, many more that are irrelevant.
I’m actually strongly reminded of a recent comment about LK-99, where someone remarked that a good way to ramp up production of superconductors would be to understand how superconductors work, because then we could design one that’s easier to mass-produce.
Except:
What we normally think of as “understanding how superconductors work” is not a sure thing, it’s hard and sometimes we don’t find satisfactory models.
Even if we understand how superconductors work, designing new ones with economically useful properties is an independent problem that’s also hard and possible to fail at for decades.
There are many other ways to make progress in discovering superconductors and ramping up their production. These ways are sometimes purely phenomenological, or sometimes rely on building some understanding of the superconductor that’s a model of a different type than what we typically mean by “understanding how superconductors work.”
It might sound good to say “we’ll understand how NNs work, and then use that to design safe ones,” but I think the problems are analogous. What we normally think of as “understand how NNs work,” especially in the context of mech interp, is a very specific genre of understanding—it’s not omniscience, it’s the ability to give certain sorts of mechanistic explanations for canonical explananda. And then using that understanding to design safe AI is an independent problem not solved just by solving the first one. Meanwhile, there are other ways to reason about the safety of AI (e.g. statistical arguments about the plausibility of gradient hacking) that use “understanding,” but not of the mech interp sort.
Yes, blue sky research is good. But we can simultaneously use our brains about what sorts of explanations we think are promising to find. Understanding doesn’t just go into a big bucket labeled “Understanding” from which we draw to make things happen. If I’m in charge of scaling up superconductor production, and I say we should do less micro-level explanation and more phenomenology, telling me about the value of blue sky research is the “wrong type of reasoning.”
EDIT: Nuance of course being impossible, this no doubt comes off as rude—and is in turn a reaction to an internet-distorted version of what you actually wrote. Oh well, grain of salt and all that.
This is equivocation. There are some properties of what’s going on inside a NN that are crucial to reasoning about its safety properties, and many, many more that are irrelevant.
I’m actually strongly reminded of a recent comment about LK-99, where someone remarked that a good way to ramp up production of superconductors would be to understand how superconductors work, because then we could design one that’s easier to mass-produce.
Except:
What we normally think of as “understanding how superconductors work” is not a sure thing, it’s hard and sometimes we don’t find satisfactory models.
Even if we understand how superconductors work, designing new ones with economically useful properties is an independent problem that’s also hard and possible to fail at for decades.
There are many other ways to make progress in discovering superconductors and ramping up their production. These ways are sometimes purely phenomenological, or sometimes rely on building some understanding of the superconductor that’s a model of a different type than what we typically mean by “understanding how superconductors work.”
It might sound good to say “we’ll understand how NNs work, and then use that to design safe ones,” but I think the problems are analogous. What we normally think of as “understand how NNs work,” especially in the context of mech interp, is a very specific genre of understanding—it’s not omniscience, it’s the ability to give certain sorts of mechanistic explanations for canonical explananda. And then using that understanding to design safe AI is an independent problem not solved just by solving the first one. Meanwhile, there are other ways to reason about the safety of AI (e.g. statistical arguments about the plausibility of gradient hacking) that use “understanding,” but not of the mech interp sort.
Yes, blue sky research is good. But we can simultaneously use our brains about what sorts of explanations we think are promising to find. Understanding doesn’t just go into a big bucket labeled “Understanding” from which we draw to make things happen. If I’m in charge of scaling up superconductor production, and I say we should do less micro-level explanation and more phenomenology, telling me about the value of blue sky research is the “wrong type of reasoning.”