I think your characterization of my position is a little off. I’m specifically pointing heuristically against a certain kind of utility function patching, whereas you seem to be emphasizing complexity in your version.
I think my claim is something like “hacking the utility function by modifying it in various ways seems similar to AI boxing, in that you face the problem of trying to anticipate how something smarter than you will approach what you think is an obstacle.”
I agree that a really good understanding could provide a solution.
However, I also suspect that any effective solution (and many ineffective solutions) which works by warping the utility function (adding penalties, etc) will by “interpretable” as an epistemic state (change the beliefs rather than the utility function). And I suspect the good solutions correspond to beliefs which accurately describe critical aspects of the problem! EG, there should just be a state of belief which a rational agent can be in which makes it behave corrigibly. I realize this claim has not been borne out by evidence thus far, however.
I think my claim is something like “hacking the utility function by modifying it in various ways seems similar to AI boxing, in that you face the problem of trying to anticipate how something smarter than you will approach what you think is an obstacle.”
There seem to be different ways you can modify the objective. Take the solution to the easy problem of wireheading: I think we’re comfortable saying there’s a solution because the AI obviously grading the future before it happens. No matter how smart you are, you’re grading the future in an obviously-better way. So, we say the problem is solved. On the other extreme is AI boxing, where you put a bunch of traffic cones in the way of a distant oncoming car and say, “there’s no way anyone could drive around this”!
I think your characterization of my position is a little off. I’m specifically pointing heuristically against a certain kind of utility function patching, whereas you seem to be emphasizing complexity in your version.
I think my claim is something like “hacking the utility function by modifying it in various ways seems similar to AI boxing, in that you face the problem of trying to anticipate how something smarter than you will approach what you think is an obstacle.”
I agree that a really good understanding could provide a solution.
However, I also suspect that any effective solution (and many ineffective solutions) which works by warping the utility function (adding penalties, etc) will by “interpretable” as an epistemic state (change the beliefs rather than the utility function). And I suspect the good solutions correspond to beliefs which accurately describe critical aspects of the problem! EG, there should just be a state of belief which a rational agent can be in which makes it behave corrigibly. I realize this claim has not been borne out by evidence thus far, however.
There seem to be different ways you can modify the objective. Take the solution to the easy problem of wireheading: I think we’re comfortable saying there’s a solution because the AI obviously grading the future before it happens. No matter how smart you are, you’re grading the future in an obviously-better way. So, we say the problem is solved. On the other extreme is AI boxing, where you put a bunch of traffic cones in the way of a distant oncoming car and say, “there’s no way anyone could drive around this”!
So IIUC, you’re advocating trying to operate on beliefs rather than utility functions? But I don’t understand why.