Thanks for writing! I agree the factors this post describes make some types of gradient hacking extremely difficult, but I don’t see how they make the following approach to gradient hacking extremely difficult.
Suppose that an agent has some trait which gradient descent is trying to push in direction x because the x-ness of that trait contributes to the agent’s high score; and that the agent wants to use gradient hacking to prevent this. Consider three possible strategies that the agent might try to implement, upon noticing that the x-component of the trait has increased [...] [One potential strategy is] Deterministically increasing the extent to which it fails as the x-component increases.
(from here)
This approach to gradient hacking seems plausibly resistant to the factors this post describes, by the following reasoning: With the above approach, the gradient hacker only worsens performance by a small amount. At the same time, the gradient hacker plausibly improves performance in other ways, since the planning abilities that lead to gradient hacking may also lead to good performance on tasks that demand planning abilities. So, overall, modifying or reducing the influence of the gradient hacker plausibly worsens performance. In other words, gradient descent might not modify away a gradient hacker because gradient hacking is convergently incentivized behavior that only worsens performance by a small amount (while not worsening it at all on net).
(Maybe gradient descent would then train the model to have a heuristic of not doing gradient hacking, while keeping the other benefits of improved planning abilities? But I feel pretty clueless about whether gradient hacking would be encoded in a way that allows such a heuristic to be inserted.)
(I read kind of quickly so may have missed something.)
I agree with parts of that. I’d also add the following (or I’d be curious why they’re not important effects):
Slower takeoff → warning shots → improved governance (e.g. through most/all major actors getting clear[er] evidence of risks) → less pressure to rush
(As OP argued) Shorter timelines → China has less of a chance to have leading AI companies → less pressure to rush
More broadly though, maybe we should be using more fine-grained concepts than “shorter timelines” and “slower takeoffs”:
The salient effects of “shorter timelines” seem pretty dependent on what the baseline is.
The point about China seems important if the baseline is 30 years, and not so much if the baseline is 10 years.
The salient effects of “slowing takeoff” seem pretty dependent on what part of the curve is being slowed. Slowing it down right before there’s large risk seems much more valuable than (just) slowing it down earlier in the curve, as the last few year’s investments in LLMs did.