The possibility is that Alice might always be able tell that she’s in a simulation no matter what we condition on. I think this is pretty much precluded by the assumption that the generative model is a good model of the world, but if that fails then it’s possible Alice can tell she’s in a simulation no matter what we do. So a lot rides on the statement that the generative model remains a good model of the world regardless of what we condition on.
Paul’s RSA-2048 counterexample is an example of a way our generative model can fail to be good enough no matter how hard we try. The core idea is that there exist things that are extremely computationally expensive to fake and very cheap to check the validity of, so faking them convincingly will be extremely hard.
The other horn is if the generative model reasons about the world abstractly, so that it just gives us a good guess about what the output of the AI would be if it really was in the real world (and got to see some large hash collision).
But now it seems likely that creating this generative model would require solving several tricky alignment problems so that it generalizes its abstractions to novel situations in ways we’d approve of.
I don’t think that’s an example of the model noticing it’s in a simulation. There’s nothing about simulations versus the real world that makes RSA instances more or less likely to pop up.
Rather, that’s a case where the model just has a defecting condition and we don’t hit it in the simulation. This is what I was getting at with “other challenge” #2.
Computationally expensive things are less likely to show up in your simulation than the real world, because you only have so much compute to run your simulation. You can’t convincingly fake the AI having access to a supercomputer.
Paul’s RSA-2048 counterexample is an example of a way our generative model can fail to be good enough no matter how hard we try. The core idea is that there exist things that are extremely computationally expensive to fake and very cheap to check the validity of, so faking them convincingly will be extremely hard.
I think this is only one horn of a dilemma.
The other horn is if the generative model reasons about the world abstractly, so that it just gives us a good guess about what the output of the AI would be if it really was in the real world (and got to see some large hash collision).
But now it seems likely that creating this generative model would require solving several tricky alignment problems so that it generalizes its abstractions to novel situations in ways we’d approve of.
I don’t think that’s an example of the model noticing it’s in a simulation. There’s nothing about simulations versus the real world that makes RSA instances more or less likely to pop up.
Rather, that’s a case where the model just has a defecting condition and we don’t hit it in the simulation. This is what I was getting at with “other challenge” #2.
Computationally expensive things are less likely to show up in your simulation than the real world, because you only have so much compute to run your simulation. You can’t convincingly fake the AI having access to a supercomputer.