You write that even if the mechanistic model is wrong, if it “has some plausible relationship to reality, the predictions that it makes can still be quite accurate.” I think that this is often true, and true in particular in the case at hand (explicit search vs not). However, I think there are many domains where this is false, where there is a large range of mechanistic models which are plausible but make very false predictions. This depends roughly on how much the details of the prediction vary depending on the details of the mechanistic model. In the explicit search case, it seems like many other plausible models for how RL agents might mechanistically function imply agent-ish behavior, even if the model is not primarily using explicit search. However, this is because, due to the fact that the agent must accomplish the training objective, the space of possible behaviors is heavily constrained. In questions where the prediction space is less constrained to begin with (e. g. questions about how the far future will go), different “mechanistic” explanations (for example, thinking that the far future will be controlled by a human superintelligence vs an alien superintelligence vs evolutionary dynamics) imply significantly different predictions.
Furthermore, one of my favorite strategies here is to come up with many different, independent mechanistic models and then see if they all converge: if you get the same prediction from lots of different mechanistic models, that adds a lot of credence to that prediction being quite robust.
You write that even if the mechanistic model is wrong, if it “has some plausible relationship to reality, the predictions that it makes can still be quite accurate.” I think that this is often true, and true in particular in the case at hand (explicit search vs not). However, I think there are many domains where this is false, where there is a large range of mechanistic models which are plausible but make very false predictions. This depends roughly on how much the details of the prediction vary depending on the details of the mechanistic model. In the explicit search case, it seems like many other plausible models for how RL agents might mechanistically function imply agent-ish behavior, even if the model is not primarily using explicit search. However, this is because, due to the fact that the agent must accomplish the training objective, the space of possible behaviors is heavily constrained. In questions where the prediction space is less constrained to begin with (e. g. questions about how the far future will go), different “mechanistic” explanations (for example, thinking that the far future will be controlled by a human superintelligence vs an alien superintelligence vs evolutionary dynamics) imply significantly different predictions.
Yes, I agree—this is why I say: