evhub answers How does Gradient Descent Interact with Goodhart?

evhub 4 Feb 2019 5:53 UTC
LW: 23 AF: 12
AF
While at a recent CFAR workshop with Scott, Peter Schmidt-Nielsen and I wrote some code to run experiments of the form that Scott is talking about here. If anyone is interested, the code can be found here , though I’ll also try to summarize our results below.

Our methodology was as follows:
1. Generate a real utility function $V : [0, 1]^{10} \to [0, 1]$ by randomly initializing a feed-forward neural network with 3 hidden layers with 10 neurons each and tanh activations, then train it using 5000 steps of gradient descent with a learning rate of 0.1 on a set of 1023 uniformly sampled data points. The reason we pre-train the network on random data is that we found that randomly initialized neural networks tended to be very similar and very smooth such that it was very easy for the proxy network to learn them, whereas networks trained on random data were significantly more variable.
2. Generate a proxy utility function $U : [0, 1]^{10} \to [0, 1]$ by training a randomly initialized neural network with the same architecture as the real network on 50 uniformly sampled points from the real utility using 1000 steps of gradient descent with a learning rate of 0.1.
3. Fix μ to be uniform sampling.
4. Let $^μ$ be uniform sampling followed by 50 steps of gradient descent on the proxy network with a learning rate of 0.1.
5. Sample 1000000 points from μ, then optimize those same points according to $^μ$ . Create buckets of radius 0.01 utilons for all proxy utility values and compute the real utility values for points in that bucket from the μ set and the $^μ$ set.
6. Repeat steps 1-5 10 times, then average the final real utility values per bucket and plot them. Furthermore, compute the “Goodhart error” as the real utility for the proxy utility points minus the real utility for the random points plotted against their proxy utility values.

The plot generated by this process is given below:

As can be seen from the plot, the Goodhart error is fairly consistently negative, implying that the gradient descent optimized points are performing worse on the real utility conditional on the proxy utility.

However, using an alternative $^μ$ , we were able to reverse the effect. That is, we ran the same experiment, but instead of optimizing the proxy utility to be close to the real utility on the sampled points, we optimized the gradient of the proxy utility to be close to the gradient of the real utility on the sampled points. This resulted in the following graph:

As can be seen from the plot, the Goodhart error flipped and became positive in this case, implying that the gradient optimization did significantly better than the point optimization.

Finally, we also did a couple of checks to ensure the correctness of our methodology.

First, one concern was that our method of bucketing could be biased. To determine the degree of “bucket error” we computed the average proxy utility for each bucket from the μ and $^μ$ datasets and took the difference. This should be identically zero, since the buckets are generated based on proxy utility, while any deviation from zero would imply a systematic bias in the buckets. We did find a significant bucket error for large bucket sizes, but for our final bucket size of 0.01, we found a bucket error in the range of 0 − 0.01, which should be negligible.

Second, another thing we did to check our methodology was to generate $^μ$ simply by sampling 100 random points, then selecting the one with the highest proxy utility value. This should give exactly the same results as μ, since bucketing conditions on the proxy utility value, and in fact that was what we got.
What links here?
- evhub's comment on Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian by Joar Skalse (30 Dec 2020 1:03 UTC; 14 points)
- Davidmanheim 18 Jun 2019 9:40 UTC
  7 points
  Parent
  Note: I briefly tried a similar approach, albeit with polynomial functions with random coefficients rather than ANNs, and in R not python, but couldn’t figure out how to say anything useful with it.
  If this is of any interest, it is available here: https://gist.github.com/davidmanheim/5231e4a82d5ffc607e953cdfdd3e3939 (I also built simulations for bog-standard Goodhart)
  I am unclear how much of my feeling that this approach is fairly useless reflects my lack of continued pursuit of building such models and figuring out what can be said, or my diversion to other work that was more fruitful, rather than a fundamental difficult of saying anything clear based on these types of simulations. I’d like to claim it’s the latter, but I’ll clearly note that it is heavily motivated reasoning.