Charlie Steiner answers How does Gradient Descent Interact with Goodhart?

Charlie Steiner Feb 2, 2019, 6:51 AM
6 points
In the rocket example, procedures A and B can both be optimized either by random sampling or by local search. A is optimizing some hand-coded rocket specifications, while B is optimizing a complicated human approval model.
The problem with A is that it relies on human hand-coding. If we put in the wrong specifications, and the output is extremely optimized, there are two possible cases: we recognize that this rocket wouldn’t work and we don’t approve it, or we think that it looks good but are probably wrong, and the rocket doesn’t work.
On the upside, if we successfully hand-coded in how a rocket should be, it will output working rockets.
The problem with B is that it’s simply the wrong thing to optimize if you want a working rocket. And because it’s modeling the environment and trying to find an output that makes the environment-model do something specific, you’ll get bad agent-like behavior.
Let’s go back to take a closer look at case A. Suppose you have the wrong rocket specifications, but they’re “pretty close” in some sense. Maybe the most spec-friendly rocket doesn’t function, but the top 0.01% of designs by the program are mostly in the top 1% of rockets ranked by your approval.
The programmed goal is proxy #1. Then you look at some of the sampled (either randomly or through local search) top 0.01% designs for something you think will fly. Your approval is proxy #2. Your goal is the rocket working well.
What you’re really hoping for in designing this system is that even if proxy #1 and proxy #2 are both misaligned, their overlap or product is more aligned—more likely to produce an actual working rocket—than either alone.
This makes sense, especially under the model of proxies as “true value + noise,” but to the extent that model is violated maybe this doesn’t work out.
This is another way of seeing what’s wrong with case B. Case B just purely optimizes proxy #2, when the whole point of having human approval is to try to combine human approval with some different proxy to get better results.
As for local search vs. random sampling, this is a question about the landscape of your optimized proxy, and how this compares to the true value—neither way is going to be better literally 100% of the time.
If we imagine local optimization like water flowing downhill in the U.S., given a random starting point, the water is much more likely to end up at the mouth of the Mississippi river than it is to end up in Death Valley, even though Death Valley is below sea level. The Mississippi just has a broad network of similar states that lead into it via local optimization, whereas Death Valley is a “surprising” optimum. Under random sampling, you’re equally likely to find equal areas of the mouth of the Mississippi or Death Valley.
Applying this to rockets, I would actually expect local search to produce much safer results in case B. Working rockets probably have broad basins of similar almost-working rockets that feed into them in configuation-space, whereas the rocket that spells out a message to the experimenter is quite a bit more fragile to perturbations.
(Even if rockets are so complicated and finnicky that we expect almost-working rockets to be rarer than convincing messages to the experimenter, we still might think that the gradient landscape makes gradient descent relatively better.)
In case A, I would expect much less difference between locally optimizing proxy #1 and sampling until it was satisfied. The difference for human approval came because we specifically didn’t want to find the unstable, surprising maxima of human approval. And maybe the same is true of our hand-coded rocket specifications, but I would expect this to be less important.