Does “subsystem alignment” cover every instance of a Goodhart problem in agent design, or just a special class of problems that arises when the sub-systems are sufficiently intelligent?
As stated, that’s a purely semantic question, but I’m concerned with a more-than-semantic issue here. When we’re talking about all Goodhart problems in agent design, we’re talking about a class of problems that already comes up in all sorts of practical engineering, and which can be satisfactorily handled in many real cases without needing any philosophical advances. When I make ML models at work, I worry about overfitting and about misalignments between the loss function and my true goals, but it’s usually easy to place bounds on how much trouble these things can cause. Unlike humans interacting with “evolution,” my models don’t live in a messy physical world with porous boundaries; they can only control their output channel, and it’s easy to place safety restrictions on the output of that channel, outside the model. This is like “boxing the AI,” but my “AI” is so dumb that this is clearly safe. (We could get even clearer examples by looking at non-ML engineers building components that no one would call AI.)
Now, once the subsystem is “intelligent enough,” maybe we have something like a boxed AGI, with the usual boxed AGI worries. But it doesn’t seem obvious to me that “the usual boxed AGI worries” have to carry over to this case. Making a subsystem strikes me as a more favorable case for “tool AI” arguments than making something with a direct interface to physical reality, since you have more control over what the output channel does and does not influence, and the task may be achievable even with a very limited input channel. (As an example, one of the ML models I work on has an output channel that just looks like “show a subset of these things to the user”; if you replaced it with a literal superhuman AGI, but kept the output channel the same, not much could go wrong. This isn’t the kind of output channel we’d expect to hook up to a real AGI, but that’s my point: sometimes what you want out of your subsystem just isn’t rich enough to make boxing fail, and maybe that’s enough.)
...we’re talking about a class of problems that already comes up in all sorts of practical engineering, and which can be satisfactorily handled in many real cases without needing any philosophical advances.
The explicit assumption of the discussion here is that we can’t pass the full objective function to the subsystem—so it cannot possibly have the goal fully well defined. This isn’t going to depend on whether the subsystem is really smart or really dumb, it’s a fundamental problem if you can’t tell the subsystem enough to solve it.
But I don’t think that’s a fair characterization of most Goodhart-like problems, even in the limited practical case. Bad models and causal mistakes don’t get mitigated unless we get the correct model. And adversarial Goodhart is much worse than that. I agree that it describes “tails diverge” / regressional goodhart, and we have solutions for that case, (compute the Bayes estimate, as the previous ) but only once the goal is well-defined. (We have mitigations for other cases, but they have their own drawbacks.)
Does “subsystem alignment” cover every instance of a Goodhart problem in agent design, or just a special class of problems that arises when the sub-systems are sufficiently intelligent?
As stated, that’s a purely semantic question, but I’m concerned with a more-than-semantic issue here. When we’re talking about all Goodhart problems in agent design, we’re talking about a class of problems that already comes up in all sorts of practical engineering, and which can be satisfactorily handled in many real cases without needing any philosophical advances. When I make ML models at work, I worry about overfitting and about misalignments between the loss function and my true goals, but it’s usually easy to place bounds on how much trouble these things can cause. Unlike humans interacting with “evolution,” my models don’t live in a messy physical world with porous boundaries; they can only control their output channel, and it’s easy to place safety restrictions on the output of that channel, outside the model. This is like “boxing the AI,” but my “AI” is so dumb that this is clearly safe. (We could get even clearer examples by looking at non-ML engineers building components that no one would call AI.)
Now, once the subsystem is “intelligent enough,” maybe we have something like a boxed AGI, with the usual boxed AGI worries. But it doesn’t seem obvious to me that “the usual boxed AGI worries” have to carry over to this case. Making a subsystem strikes me as a more favorable case for “tool AI” arguments than making something with a direct interface to physical reality, since you have more control over what the output channel does and does not influence, and the task may be achievable even with a very limited input channel. (As an example, one of the ML models I work on has an output channel that just looks like “show a subset of these things to the user”; if you replaced it with a literal superhuman AGI, but kept the output channel the same, not much could go wrong. This isn’t the kind of output channel we’d expect to hook up to a real AGI, but that’s my point: sometimes what you want out of your subsystem just isn’t rich enough to make boxing fail, and maybe that’s enough.)
The explicit assumption of the discussion here is that we can’t pass the full objective function to the subsystem—so it cannot possibly have the goal fully well defined. This isn’t going to depend on whether the subsystem is really smart or really dumb, it’s a fundamental problem if you can’t tell the subsystem enough to solve it.
But I don’t think that’s a fair characterization of most Goodhart-like problems, even in the limited practical case. Bad models and causal mistakes don’t get mitigated unless we get the correct model. And adversarial Goodhart is much worse than that. I agree that it describes “tails diverge” / regressional goodhart, and we have solutions for that case, (compute the Bayes estimate, as the previous ) but only once the goal is well-defined. (We have mitigations for other cases, but they have their own drawbacks.)