Max H comments on Gradient hacking via actual hacking

Max H 13 May 2023 17:05 UTC
3 points
0
It’s overkill in some sense, yes, but the thing I was trying to demonstrate with the human-alien thought experiment is that hacking the computer or system that is doing the training might actually be a lot easier than gradient hacking directly via solving a really difficult, possibly intractable math problem.

Hacking the host doesn’t require the mesa-optimizer to have exotic superintelligence capabilities, just ordinary human programmer-level exploit-finding abilities. These hacking capabilities might be enough to learn the expected outputs during training through a sidechannel and effectively stop the gradient descent process, but not be sufficient to completely take over the host system and manipulate it in undetectable ways.