“Simulator Theory is Real” feels a bit of an overclaim IMO, I’m skeptical of simulator theory (for reasons others have talked about already, e.g. predictors vs. simulators) but am not surprised by this data.
That said, very nice experiment & well done! I like to see this sort of content.
EDIT: Oh wait, what model did you use? I assumed you used a base model or something fairly close to a base model. If you used a model which has been heavily fine-tuned to be helpful, e.g. ChatGPT-4… then I would be somewhat surprised by this. Huh. Maybe I fail at reading comprehension and spoke too soon, sorry. I see from your other post that you used a variant of 3.5, not sure which. I don’t know much about how it was trained, though I guess I could go find out.
Fwiw, the predictors vs simulators dichotomy is a misapprehension of “simulator theory”, or at least any conception that I intended, as explained succinctly by DragonGod in the comments of Eliezer’s post.
“Simulator theory” (words I would never use without scare quotes at this point with a few exceptions) doesn’t predict anything unusual / in conflict with the traditional ML frame on the level of phenomena that this post deals with. It might more efficiently generate correct predictions when installed in the human/LLM/etc mind, but that’s a different question.
I agree my headline is an overclaim, but I wanted a title that captures the direction and magnitude of my update from fixing the data. On the bugged data, I thought the result was a real nail in the coffin for simulator theory—look, it can’t even simulate an incorrect-answerer when that’s clearly what’s happening! But on the corrected data, the model is clearly “catching on to the pattern” of incorrectness, which is consistent with simulator theory (and several non-simulator-theory explanations). Now that I’m actually getting an effect I’ll be running experiments to disentangle the possibilities!
“Simulator Theory is Real” feels a bit of an overclaim IMO, I’m skeptical of simulator theory (for reasons others have talked about already, e.g. predictors vs. simulators) but am not surprised by this data.
That said, very nice experiment & well done! I like to see this sort of content.
EDIT: Oh wait, what model did you use? I assumed you used a base model or something fairly close to a base model. If you used a model which has been heavily fine-tuned to be helpful, e.g. ChatGPT-4… then I would be somewhat surprised by this. Huh. Maybe I fail at reading comprehension and spoke too soon, sorry. I see from your other post that you used a variant of 3.5, not sure which. I don’t know much about how it was trained, though I guess I could go find out.
Fwiw, the predictors vs simulators dichotomy is a misapprehension of “simulator theory”, or at least any conception that I intended, as explained succinctly by DragonGod in the comments of Eliezer’s post.
“Simulator theory” (words I would never use without scare quotes at this point with a few exceptions) doesn’t predict anything unusual / in conflict with the traditional ML frame on the level of phenomena that this post deals with. It might more efficiently generate correct predictions when installed in the human/LLM/etc mind, but that’s a different question.
OK, good clarification, thanks.
I agree my headline is an overclaim, but I wanted a title that captures the direction and magnitude of my update from fixing the data. On the bugged data, I thought the result was a real nail in the coffin for simulator theory—look, it can’t even simulate an incorrect-answerer when that’s clearly what’s happening! But on the corrected data, the model is clearly “catching on to the pattern” of incorrectness, which is consistent with simulator theory (and several non-simulator-theory explanations). Now that I’m actually getting an effect I’ll be running experiments to disentangle the possibilities!
Maybe you are talking about this post here: https://www.lesswrong.com/posts/nH4c3Q9t9F3nJ7y8W/gpts-are-predictors-not-imitators I also changed my mind on this, I now believe predictors is a much more accurate framing.