This feeds into my general impression that we should in most cases be thinking about getting the system to really do what we want, rather than warping its utility function to try and de-motivate it from making trouble.
A decomposition that’s been on my mind lately: we can center our framing on the alignment and motivation of the system’s actual goal (what you’re leaning towards), and we can also center our framing on why misspecifications are magnified into catastrophically bad behavior, as opposed to just bad behavior.
We can look at attempts to e.g. find one simple easy wish that gets what we want (“AI alignment researchers hate him! Find out how he aligns superintelligence with one simple wish!”), but by combining concepts like superexponential concept space/fragility of value and Goodhart’s law, we can see why there shouldn’t be a low complexity object-level solution. So, we know not to look.
My understanding of the update being done on your general impression here is: “there are lots of past attempts to apply simple fixes to avoid disastrous / power-seeking behavior, and those all break, and also complexity of value. In combination with those factors, there shouldn’t be a simple way to avoid catastrophes because nearest-unblocked-solution.”
But I suggest there might be something missing from that argument, because there isn’t yet common gears-level understanding of why catastrophes happen by default, so how do we know that we can’t prevent catastrophes from being incentivized? Like, it seems imaginable that we could understand the gears so well that we can avoid problems; after all, the gears underlying catastrophic incentives are not the same as the gears underlying specification difficulty.
It may in fact just be the case that yes, preventing catastrophic incentives does not admit a simple and obviously-correct solution! A strong judgment seems premature; it isn’t obvious to me whether this is true. I do think that we should be thinking about why these incentives exist, regardless of whether there is a simple object-level solution.
I think your characterization of my position is a little off. I’m specifically pointing heuristically against a certain kind of utility function patching, whereas you seem to be emphasizing complexity in your version.
I think my claim is something like “hacking the utility function by modifying it in various ways seems similar to AI boxing, in that you face the problem of trying to anticipate how something smarter than you will approach what you think is an obstacle.”
I agree that a really good understanding could provide a solution.
However, I also suspect that any effective solution (and many ineffective solutions) which works by warping the utility function (adding penalties, etc) will by “interpretable” as an epistemic state (change the beliefs rather than the utility function). And I suspect the good solutions correspond to beliefs which accurately describe critical aspects of the problem! EG, there should just be a state of belief which a rational agent can be in which makes it behave corrigibly. I realize this claim has not been borne out by evidence thus far, however.
I think my claim is something like “hacking the utility function by modifying it in various ways seems similar to AI boxing, in that you face the problem of trying to anticipate how something smarter than you will approach what you think is an obstacle.”
There seem to be different ways you can modify the objective. Take the solution to the easy problem of wireheading: I think we’re comfortable saying there’s a solution because the AI obviously grading the future before it happens. No matter how smart you are, you’re grading the future in an obviously-better way. So, we say the problem is solved. On the other extreme is AI boxing, where you put a bunch of traffic cones in the way of a distant oncoming car and say, “there’s no way anyone could drive around this”!
A decomposition that’s been on my mind lately: we can center our framing on the alignment and motivation of the system’s actual goal (what you’re leaning towards), and we can also center our framing on why misspecifications are magnified into catastrophically bad behavior, as opposed to just bad behavior.
We can look at attempts to e.g. find one simple easy wish that gets what we want (“AI alignment researchers hate him! Find out how he aligns superintelligence with one simple wish!”), but by combining concepts like superexponential concept space/fragility of value and Goodhart’s law, we can see why there shouldn’t be a low complexity object-level solution. So, we know not to look.
My understanding of the update being done on your general impression here is: “there are lots of past attempts to apply simple fixes to avoid disastrous / power-seeking behavior, and those all break, and also complexity of value. In combination with those factors, there shouldn’t be a simple way to avoid catastrophes because nearest-unblocked-solution.”
But I suggest there might be something missing from that argument, because there isn’t yet common gears-level understanding of why catastrophes happen by default, so how do we know that we can’t prevent catastrophes from being incentivized? Like, it seems imaginable that we could understand the gears so well that we can avoid problems; after all, the gears underlying catastrophic incentives are not the same as the gears underlying specification difficulty.
It may in fact just be the case that yes, preventing catastrophic incentives does not admit a simple and obviously-correct solution! A strong judgment seems premature; it isn’t obvious to me whether this is true. I do think that we should be thinking about why these incentives exist, regardless of whether there is a simple object-level solution.
I think your characterization of my position is a little off. I’m specifically pointing heuristically against a certain kind of utility function patching, whereas you seem to be emphasizing complexity in your version.
I think my claim is something like “hacking the utility function by modifying it in various ways seems similar to AI boxing, in that you face the problem of trying to anticipate how something smarter than you will approach what you think is an obstacle.”
I agree that a really good understanding could provide a solution.
However, I also suspect that any effective solution (and many ineffective solutions) which works by warping the utility function (adding penalties, etc) will by “interpretable” as an epistemic state (change the beliefs rather than the utility function). And I suspect the good solutions correspond to beliefs which accurately describe critical aspects of the problem! EG, there should just be a state of belief which a rational agent can be in which makes it behave corrigibly. I realize this claim has not been borne out by evidence thus far, however.
There seem to be different ways you can modify the objective. Take the solution to the easy problem of wireheading: I think we’re comfortable saying there’s a solution because the AI obviously grading the future before it happens. No matter how smart you are, you’re grading the future in an obviously-better way. So, we say the problem is solved. On the other extreme is AI boxing, where you put a bunch of traffic cones in the way of a distant oncoming car and say, “there’s no way anyone could drive around this”!
So IIUC, you’re advocating trying to operate on beliefs rather than utility functions? But I don’t understand why.