Ah, yeah, that’s true, there’s not much concern about getting too much of a good thing and that actually being good, which does seem like a reasonable category for anti-Goodharting.
It’s a bit hard to think when this would actually happen, though, since usually you have to give something up, even if it’s just the opportunity to have done less. For example, maybe I’m trying to get a B on a test because that will let me pass the class and graduate, but I accidentally get an A. The A is actually better and I don’t mind getting it, but then I’m potentially left with regret that I put in too much effort.
Most examples I can think of that look like potential anti-Goodharting seem the same: I don’t mind that I overshot the target, but I do mind that I wasn’t as efficient as I could have been.
That test / class example isn’t even a case because the test is instrumental to the goal, it’s not a metric. Your U in this case is “time spent studying” which you accurately see will be un-correlatrd from “graduating” if all students (or all counterfactual “you”s) attempt to optomize it.
Ah, yeah, that’s true, there’s not much concern about getting too much of a good thing and that actually being good, which does seem like a reasonable category for anti-Goodharting.
It’s a bit hard to think when this would actually happen, though, since usually you have to give something up, even if it’s just the opportunity to have done less. For example, maybe I’m trying to get a B on a test because that will let me pass the class and graduate, but I accidentally get an A. The A is actually better and I don’t mind getting it, but then I’m potentially left with regret that I put in too much effort.
Most examples I can think of that look like potential anti-Goodharting seem the same: I don’t mind that I overshot the target, but I do mind that I wasn’t as efficient as I could have been.
That test / class example isn’t even a case because the test is instrumental to the goal, it’s not a metric. Your U in this case is “time spent studying” which you accurately see will be un-correlatrd from “graduating” if all students (or all counterfactual “you”s) attempt to optomize it.