I think the surgeon can always be made ~arbitrarily powerful, and the trick is making it not too powerful/trivially powerful (in ways that e.g. preclude the model from performing well despite the surgeon’s interference).
So I think the core question is: are there ways to make a sufficiently powerful surgeon which is also still defeasible by a model that does what we want?
Given that we want the surgeon to be of bounded size (if we’re using a neural net implementation which seems likely to me), can it still be arbitrarily powerful? That doesn’t seem obvious to me.
That’s definitely a thing that can happen.
I think the surgeon can always be made ~arbitrarily powerful, and the trick is making it not too powerful/trivially powerful (in ways that e.g. preclude the model from performing well despite the surgeon’s interference).
So I think the core question is: are there ways to make a sufficiently powerful surgeon which is also still defeasible by a model that does what we want?
Given that we want the surgeon to be of bounded size (if we’re using a neural net implementation which seems likely to me), can it still be arbitrarily powerful? That doesn’t seem obvious to me.