I think my claim is something like “hacking the utility function by modifying it in various ways seems similar to AI boxing, in that you face the problem of trying to anticipate how something smarter than you will approach what you think is an obstacle.”
There seem to be different ways you can modify the objective. Take the solution to the easy problem of wireheading: I think we’re comfortable saying there’s a solution because the AI obviously grading the future before it happens. No matter how smart you are, you’re grading the future in an obviously-better way. So, we say the problem is solved. On the other extreme is AI boxing, where you put a bunch of traffic cones in the way of a distant oncoming car and say, “there’s no way anyone could drive around this”!
There seem to be different ways you can modify the objective. Take the solution to the easy problem of wireheading: I think we’re comfortable saying there’s a solution because the AI obviously grading the future before it happens. No matter how smart you are, you’re grading the future in an obviously-better way. So, we say the problem is solved. On the other extreme is AI boxing, where you put a bunch of traffic cones in the way of a distant oncoming car and say, “there’s no way anyone could drive around this”!