It’s overkill in some sense, yes, but the thing I was trying to demonstrate with the human-alien thought experiment is that hacking the computer or system that is doing the training might actually be a lot easier than gradient hacking directly via solving a really difficult, possibly intractable math problem.
Hacking the host doesn’t require the mesa-optimizer to have exotic superintelligence capabilities, just ordinary human programmer-level exploit-finding abilities. These hacking capabilities might be enough to learn the expected outputs during training through a sidechannel and effectively stop the gradient descent process, but not be sufficient to completely take over the host system and manipulate it in undetectable ways.
It’s overkill in some sense, yes, but the thing I was trying to demonstrate with the human-alien thought experiment is that hacking the computer or system that is doing the training might actually be a lot easier than gradient hacking directly via solving a really difficult, possibly intractable math problem.
Hacking the host doesn’t require the mesa-optimizer to have exotic superintelligence capabilities, just ordinary human programmer-level exploit-finding abilities. These hacking capabilities might be enough to learn the expected outputs during training through a sidechannel and effectively stop the gradient descent process, but not be sufficient to completely take over the host system and manipulate it in undetectable ways.