I think it depends almost entirely on the shape of V and W.
In order to do gradient descent, you need a function which is continuous and differentiable. So W can’t be noise in the traditional regression sense (independent and identically distributed for each individual observation), because that’s not going to be differentiable.
If W has lots of narrow, spiky local maxima with broad bases, then gradient descent is likely to find those local maxima, while random sampling rarely hits them. In this case, fake wins are likely to outnumber real wins in the gradient descent group, but not the random sampling group.
More generally, if U = V + W, then dU/dx = dV/dx + dW/dx. If V’s gradient is typically bigger than W’s gradient, gradient descent will mostly pay attention to V; the reverse is true if W’s gradient is typically bigger.
But even if W’s gradient typically exceeds V’s gradient, U’s gradient will still correlate with V’s, assuming dV/dx and dW/dx are uncorrelated. (cov(dU, dV) = cov(dV+dW, dV) = cov(dV, dV) + cov(dW, dV) = cov(dV, dV).)
So I’d expect that if you change your experiment so instead of looking at the results in some band, you instead take the best n results from each group, the best n results of the gradient descent group will be better on average. Another intuition pump: Let’s consider the spiky W scenario again. If V is constant everywhere, gradient descent will basically find us the nearest local maximum in W, which essentially adds random movement. But if V is a plane with a constant slope, and the random initialization is near two different local maxima in W, gradient descent will be biased towards the local maximum in W which is higher up on the plane of V. The very best points will tend to be those that are both on top of a spike in W and high up on the plane of V.
I think this is a more general point which applies regardless of the optimization algorithm you’re using: If your proxy consists of something you’re trying to maximize plus unrelated noise that’s roughly constant in magnitude, you’re still best off maximizing the heck out of that proxy, because the very highest value of the proxy will tend to be a point where the noise is high and the thing you’re trying to maximize is also high.
“Constant unrelated noise” is an important assumption. For example, if you’re dealing with a machine learning model, noise is likely to be higher for inputs off of the training distribution, so the top n points might be points far off the training distribution chosen mainly on the basis of noise. (Goodhart’s Law arguably reduces to the problem of distributional shift.) I guess then the question is what the analogous region of input space is for approval. Where does the correspondence between approval and human value tend to break down?
(Note: Although W can’t be i.i.d., W’s gradient could be faked so it is. I think this corresponds to perturbed gradient descent, which apparently helps performance on V too.)
I think it depends almost entirely on the shape of V and W.
In order to do gradient descent, you need a function which is continuous and differentiable. So W can’t be noise in the traditional regression sense (independent and identically distributed for each individual observation), because that’s not going to be differentiable.
If W has lots of narrow, spiky local maxima with broad bases, then gradient descent is likely to find those local maxima, while random sampling rarely hits them. In this case, fake wins are likely to outnumber real wins in the gradient descent group, but not the random sampling group.
More generally, if U = V + W, then dU/dx = dV/dx + dW/dx. If V’s gradient is typically bigger than W’s gradient, gradient descent will mostly pay attention to V; the reverse is true if W’s gradient is typically bigger.
But even if W’s gradient typically exceeds V’s gradient, U’s gradient will still correlate with V’s, assuming dV/dx and dW/dx are uncorrelated. (cov(dU, dV) = cov(dV+dW, dV) = cov(dV, dV) + cov(dW, dV) = cov(dV, dV).)
So I’d expect that if you change your experiment so instead of looking at the results in some band, you instead take the best n results from each group, the best n results of the gradient descent group will be better on average. Another intuition pump: Let’s consider the spiky W scenario again. If V is constant everywhere, gradient descent will basically find us the nearest local maximum in W, which essentially adds random movement. But if V is a plane with a constant slope, and the random initialization is near two different local maxima in W, gradient descent will be biased towards the local maximum in W which is higher up on the plane of V. The very best points will tend to be those that are both on top of a spike in W and high up on the plane of V.
I think this is a more general point which applies regardless of the optimization algorithm you’re using: If your proxy consists of something you’re trying to maximize plus unrelated noise that’s roughly constant in magnitude, you’re still best off maximizing the heck out of that proxy, because the very highest value of the proxy will tend to be a point where the noise is high and the thing you’re trying to maximize is also high.
“Constant unrelated noise” is an important assumption. For example, if you’re dealing with a machine learning model, noise is likely to be higher for inputs off of the training distribution, so the top n points might be points far off the training distribution chosen mainly on the basis of noise. (Goodhart’s Law arguably reduces to the problem of distributional shift.) I guess then the question is what the analogous region of input space is for approval. Where does the correspondence between approval and human value tend to break down?
(Note: Although W can’t be i.i.d., W’s gradient could be faked so it is. I think this corresponds to perturbed gradient descent, which apparently helps performance on V too.)