may make it hard to maintain an accurate simulation, so it is not obvious the model would choose to do this
Whence the choice to maintain an accurate simulation? An RLHF model may get reinforced not to do this, Best-of-n may select against this, a human operator may notice that it’s hard to maintain an accurate simulation and change the prompt. It seems about as hard to guard against as p-hacking.
In the first part of this sequence, we clarify that we are focusing on the case where the model is a predictive model of the world. The fourth part, on making inner alignment as easy as possible, outlines some reasons why we think this kind of predictive model is possible (even likely) outcome of the training process. Of course, it is also possible that the model is not precisely a predictive model, but is still close enough to one that the content of “Conditioning Predictive Models” is still relevant.
Whence the choice to maintain an accurate simulation? An RLHF model may get reinforced not to do this, Best-of-n may select against this, a human operator may notice that it’s hard to maintain an accurate simulation and change the prompt. It seems about as hard to guard against as p-hacking.
In the first part of this sequence, we clarify that we are focusing on the case where the model is a predictive model of the world. The fourth part, on making inner alignment as easy as possible, outlines some reasons why we think this kind of predictive model is possible (even likely) outcome of the training process. Of course, it is also possible that the model is not precisely a predictive model, but is still close enough to one that the content of “Conditioning Predictive Models” is still relevant.