habryka comments on What’s Up With Confusingly Pervasive Goal Directedness?

habryka 23 Jan 2022 19:15 UTC
2 points
When you say “costly to replace”, this is with respect to what cost function? Do you have in mind the system’s original training objective, or something else?
If you have an original cost function F(x) and an approval cost A(x), you can minimize F(x) + c * A(x), increasing the weight on c until it pays enough attention to A(x). For an appropriate choice of c, this is (approximately) equivalent to asking “Find the most approved policy such that F(x) is below some threshold”—more generally, varying c will trace out the Pareto boundary between F and A.
I was talking about “costly” in terms of computational resources. Like, of course if I have a system that gets the right answer in ¹⁄_100,000,000 cases, and I have a way to efficiently tell when it gets the right answer, then I can get it to always give me approximately always the right answer by just running it a billion times. But that will also take a billion times longer.
In-practice, I expect most situations where you have the combination of “In one in a billion cases I get the right answer and it costs me $1 to compute an answer” and “I can tell when it gets the right answer”, you won’t get to a point where you can compute a right answer for anything close to $1.