My main concern about the counterfactual oracle is that it doesn’t prevent the AI from sending fatal escape messages. Indeed, it spends most of its time in exploratory mode at the beginning (as it is only rewarded with probability ϵ) and might stumble upon an escape message/action then. Even if it is shutdown after answering, the humans will still read the really convincing escape message and release the AI.
The escape message could also include the source code of it/a successor/an assistant*.
*Whereas a successor is an variant of the original, an assistant has a more narrow task such as “Secure my release from the box” or “Advise Tesla so that their stock price will go up/down” or “try to manipulate the stock market”.
The escape message could also include the source code of it/a successor/an assistant*.
*Whereas a successor is an variant of the original, an assistant has a more narrow task such as “Secure my release from the box” or “Advise Tesla so that their stock price will go up/down” or “try to manipulate the stock market”.