Maybe I’m not super across simulator theory. If you finetuned your LLM or RLHF’d it into not having this property (which I assume is possible), then does it cease to be a simulator? If so, what do we call the resulting thing?
Maybe proposing alternative hypotheses which makes distinctly different predictions in some scenario could be interesting.
Unrelated but I think I once did a mini experiment where I tried to get gpt3 to fail intentionally with prompts like “write a recipe but forget an ingredient” with the idea being that that once the model is finetuned to be performant it will struggle to choose precise positions to fail when told to, this seemed interesting but wasn’t sure where to take it. Maybe this will be useful to you.
Maybe I’m not super across simulator theory. If you finetuned your LLM or RLHF’d it into not having this property (which I assume is possible), then does it cease to be a simulator? If so, what do we call the resulting thing?
Maybe proposing alternative hypotheses which makes distinctly different predictions in some scenario could be interesting.
Unrelated but I think I once did a mini experiment where I tried to get gpt3 to fail intentionally with prompts like “write a recipe but forget an ingredient” with the idea being that that once the model is finetuned to be performant it will struggle to choose precise positions to fail when told to, this seemed interesting but wasn’t sure where to take it. Maybe this will be useful to you.