you can’t rely on behavioral inspection to determine facts about an AI which that AI might want to deceive you about. (Including how smart it is, or whether it’s acquired strategic awareness.)
I don’t buy this.
At a sufficiently granular scale, the development of the capabilities of deception and strategic awareness will be be smooth and continuous.
Even in cases of a where an AGI is shooting up to superintelligence over a couple of minutes, and immediately deciding to hide its capabilities, we could detect that by eg, spinning off a version of the agent every 1000 gradient steps, and running it through a testing regime. As long as we are testing frequently enough, we can see gradual increases in capability, which might prompt the system to increase the testing frequency. And we could have bright-lines at which we stop training altogether. (For instance, when previously exhibited capabilities start to disappear, or when the system makes some initial fumbling steps at deception.)
I don’t necessarily expect anyone to implement a system like this, but it seems like a way to use behavioral inspection to determine those facts, so long as the system is improving through SGD.
I don’t buy this.
At a sufficiently granular scale, the development of the capabilities of deception and strategic awareness will be be smooth and continuous.
Even in cases of a where an AGI is shooting up to superintelligence over a couple of minutes, and immediately deciding to hide its capabilities, we could detect that by eg, spinning off a version of the agent every 1000 gradient steps, and running it through a testing regime. As long as we are testing frequently enough, we can see gradual increases in capability, which might prompt the system to increase the testing frequency. And we could have bright-lines at which we stop training altogether. (For instance, when previously exhibited capabilities start to disappear, or when the system makes some initial fumbling steps at deception.)
I don’t necessarily expect anyone to implement a system like this, but it seems like a way to use behavioral inspection to determine those facts, so long as the system is improving through SGD.