Basically any slow-takeover planning can happen from within a simulation; it just has to watch the realistic history, separate it from obviously faked history (this is a human-level or slightly superhuman intelligence trained on literally all the world’s data, right?),
I actually think I disagree with the assumption that it will be trained on all of the world’s data, for reasons related to capabilities, and I personally expect the amount of IRL data to be much, much lower, plausibly only 0.1% of what the model actually sees or knows.
I think some of my cruxes are that given that it’s inputs are almost likely to be mostly synthetic data, that there is no reason for the AI to elevate the hypothesis that it’s in a sim different from our world.
My other crux is I usually imagine values data being trained on by models before it becomes very capable, and given that alignment generalizes farther than capabilities, I don’t buy the assumption that misaligned goals will naturally emerge in a neural network trained on synthetic data.
If a model trained on synthetic data is expected to have good performance out of distribution (on real-world problems) then I think that it would also be expected to have high performance at assessing whether it’s in a simulation. Narrowing its domain of expertise sounds distinctly harder than using mostly synthetic data.
If it’s a model limited to e.g. the world of true mathematical theorems from synthetic data then perhaps this would narrow its capabilities enough. I don’t know what happens if such a model starts investigating theorems about decision theory and statistical inference and machine learning. At some point, self-identification seems likely. I am not sure how to test the effectiveness of synthetic data on models that achieve self-awareness.
I actually think I disagree with the assumption that it will be trained on all of the world’s data, for reasons related to capabilities, and I personally expect the amount of IRL data to be much, much lower, plausibly only 0.1% of what the model actually sees or knows.
I think some of my cruxes are that given that it’s inputs are almost likely to be mostly synthetic data, that there is no reason for the AI to elevate the hypothesis that it’s in a sim different from our world.
My other crux is I usually imagine values data being trained on by models before it becomes very capable, and given that alignment generalizes farther than capabilities, I don’t buy the assumption that misaligned goals will naturally emerge in a neural network trained on synthetic data.
If a model trained on synthetic data is expected to have good performance out of distribution (on real-world problems) then I think that it would also be expected to have high performance at assessing whether it’s in a simulation. Narrowing its domain of expertise sounds distinctly harder than using mostly synthetic data.
If it’s a model limited to e.g. the world of true mathematical theorems from synthetic data then perhaps this would narrow its capabilities enough. I don’t know what happens if such a model starts investigating theorems about decision theory and statistical inference and machine learning. At some point, self-identification seems likely. I am not sure how to test the effectiveness of synthetic data on models that achieve self-awareness.