What I had in mind is a situation where we have access to the latent variables during training, and only use the model to prove safety properties in situations that are within the range of the training distribution in some sense (eg, situations where we have some learning-theoretic guarantees). As for treacherous turns, I am implicitly assuming that we don’t have to worry about a treacherous turn from the world model, but that we may have to worry about it from the AI policy that we’re verifying.
However, note that even this is meaningfully different from just using RLHF, especially in settings with some adversarial component. In particular, a situation that is OOD for the policy need not be OOD for the world model. For example, learning a model of the rules of chess is much easier than learning a policy that is good at playing chess. It would also be much easier to prove a learning-theoretic guarantee for the former than the latter.
So, suppose we’re training a chess-playing AI, and we want to be sure that it cannot be defeated in n moves or less. The RLHF strategy would, in this scenario, essentially amount to letting a bunch of humans play against the AI a bunch of times, try to find cases where the red-teamers find a way to beat the AI in n moves, and then train against those cases. This would only give us very weak quantitative guarantees, because there might be strategies that the red teamers didn’t think of.
Alternatively, we could also train a network to model the rules of chess (in this particular example, we could of course also specify this model manually, but let’s ignore that for the sake of the argument). It seems fairly likely that we could train this model to be highly accurate. Moreover, using normal statistical methods, we could estimate a bound on the fraction of the state-action space on which this model makes an incorrect prediction, and derive other learning-theoretic guarantees (depending on how the training data is collected, etc). We could then formally verify that the chess AI cannot be beaten in n moves, relative to this world model. This would produce a much stronger quantitative guarantee, and the assumptions behind this guarantee would be much easier to audit. The guarantee would of course still not be an absolute proof, because there will likely be some errors in the world model, and the chess AI might be targeting these errors, etc, but the guarantee is substantially better than what you get in the RLHF case. Also note that we, as we run the chess AI, could track the predictions of the world model on-line. If the world model ever makes a prediction that contradicts what in fact happens, we could shut down the chess AI, or transition to a safe mode. This gives us even stronger guarantees.
This is of course a toy example, because it is a case where we can design a perfect world model manually (excluding the opponent, of course). We can also design a chess-playing AI manually, etc. However, I think it illustrates that there is a meaningful difference between RLHF and formal verification relative to even a black-box world model. The complexity of the space of all policies grows much faster than the complexity of the environment, and this fact can be exploited.
What I had in mind is a situation where we have access to the latent variables during training, and only use the model to prove safety properties in situations that are within the range of the training distribution in some sense (eg, situations where we have some learning-theoretic guarantees). As for treacherous turns, I am implicitly assuming that we don’t have to worry about a treacherous turn from the world model, but that we may have to worry about it from the AI policy that we’re verifying.
However, note that even this is meaningfully different from just using RLHF, especially in settings with some adversarial component. In particular, a situation that is OOD for the policy need not be OOD for the world model. For example, learning a model of the rules of chess is much easier than learning a policy that is good at playing chess. It would also be much easier to prove a learning-theoretic guarantee for the former than the latter.
So, suppose we’re training a chess-playing AI, and we want to be sure that it cannot be defeated in n moves or less. The RLHF strategy would, in this scenario, essentially amount to letting a bunch of humans play against the AI a bunch of times, try to find cases where the red-teamers find a way to beat the AI in n moves, and then train against those cases. This would only give us very weak quantitative guarantees, because there might be strategies that the red teamers didn’t think of.
Alternatively, we could also train a network to model the rules of chess (in this particular example, we could of course also specify this model manually, but let’s ignore that for the sake of the argument). It seems fairly likely that we could train this model to be highly accurate. Moreover, using normal statistical methods, we could estimate a bound on the fraction of the state-action space on which this model makes an incorrect prediction, and derive other learning-theoretic guarantees (depending on how the training data is collected, etc). We could then formally verify that the chess AI cannot be beaten in n moves, relative to this world model. This would produce a much stronger quantitative guarantee, and the assumptions behind this guarantee would be much easier to audit. The guarantee would of course still not be an absolute proof, because there will likely be some errors in the world model, and the chess AI might be targeting these errors, etc, but the guarantee is substantially better than what you get in the RLHF case. Also note that we, as we run the chess AI, could track the predictions of the world model on-line. If the world model ever makes a prediction that contradicts what in fact happens, we could shut down the chess AI, or transition to a safe mode. This gives us even stronger guarantees.
This is of course a toy example, because it is a case where we can design a perfect world model manually (excluding the opponent, of course). We can also design a chess-playing AI manually, etc. However, I think it illustrates that there is a meaningful difference between RLHF and formal verification relative to even a black-box world model. The complexity of the space of all policies grows much faster than the complexity of the environment, and this fact can be exploited.