That’s a good point: it definitely pushes in the direction of making the model’s internals harder to adversarially attack. I do wonder how accessible “encrypted” is here versus just “actually robust” (which is what I’m hoping for in this approach). The intuition here is that you want your model to be able to identify that a rogue thought like “kill people” is not a thing to act on, and that looks like being robust.
I haven’t thought about this a lot, but “encrypted” could just mean “just beyond the capabilities of the Surgeon to identify”. So the gradient could be moving in a direction away from “easily identifiable early deceptive circuits” instead of “deception”, and plausibly in a way that scales with how weak the Surgeon is. Do you think we can design Surgeons that are powerful enough even at interpretable sizes to net the latter? Do surgical capabilities like this generally scale linearly?
I think the surgeon can always be made ~arbitrarily powerful, and the trick is making it not too powerful/trivially powerful (in ways that e.g. preclude the model from performing well despite the surgeon’s interference).
So I think the core question is: are there ways to make a sufficiently powerful surgeon which is also still defeasible by a model that does what we want?
Given that we want the surgeon to be of bounded size (if we’re using a neural net implementation which seems likely to me), can it still be arbitrarily powerful? That doesn’t seem obvious to me.
That’s a good point: it definitely pushes in the direction of making the model’s internals harder to adversarially attack. I do wonder how accessible “encrypted” is here versus just “actually robust” (which is what I’m hoping for in this approach). The intuition here is that you want your model to be able to identify that a rogue thought like “kill people” is not a thing to act on, and that looks like being robust.
I haven’t thought about this a lot, but “encrypted” could just mean “just beyond the capabilities of the Surgeon to identify”. So the gradient could be moving in a direction away from “easily identifiable early deceptive circuits” instead of “deception”, and plausibly in a way that scales with how weak the Surgeon is. Do you think we can design Surgeons that are powerful enough even at interpretable sizes to net the latter? Do surgical capabilities like this generally scale linearly?
That’s definitely a thing that can happen.
I think the surgeon can always be made ~arbitrarily powerful, and the trick is making it not too powerful/trivially powerful (in ways that e.g. preclude the model from performing well despite the surgeon’s interference).
So I think the core question is: are there ways to make a sufficiently powerful surgeon which is also still defeasible by a model that does what we want?
Given that we want the surgeon to be of bounded size (if we’re using a neural net implementation which seems likely to me), can it still be arbitrarily powerful? That doesn’t seem obvious to me.