Instead of “Goodharting”, I like the potential names “Positive Alignment” and “Negative Alignment.”
″Positive Alignment” means that the motivated party changes their actions in ways the incentive creator likes. “Negative Alignment” means the opposite.
Whenever there are incentives offered to certain people/agents, there are likely to be cases of both Positive Alignment and Negative Alignment. The net effect will likely be either positive or negative.
“Goodharting” is fairly vague and typically just refers to just the “Negative Alignment” portion.
I’d expect this to make some discussion clearer. ”Will this new incentive be goodharted?” → “Will this incentive lead to Net-Negative Alignment?”
Instead of “Goodharting”, I like the potential names “Positive Alignment” and “Negative Alignment.”
″Positive Alignment” means that the motivated party changes their actions in ways the incentive creator likes. “Negative Alignment” means the opposite.
Whenever there are incentives offered to certain people/agents, there are likely to be cases of both Positive Alignment and Negative Alignment. The net effect will likely be either positive or negative.
“Goodharting” is fairly vague and typically just refers to just the “Negative Alignment” portion.
I’d expect this to make some discussion clearer.
”Will this new incentive be goodharted?” → “Will this incentive lead to Net-Negative Alignment?”
Other Name Options
Claude 3.7 recommended other naming ideas like:
Intentional vs Perverse Responses
Convergent vs Divergent Optimization
True-Goal vs Proxy-Goal Alignment
Productive vs Counterproductive Compliance