One potential weakness about your method is that you didn’t use a base (foundation) model, but apparently the heavily finetuned gpt-3.5-turbo. The different system prompts probably can’t negate the effect of this common fine-tuning completely. It would be interesting how the results hold up when you use code-davinci-002, the GPT-3.5 base model, which has no instruction tuning or RLHF applied. Though this model is no longer available via OpenAI, it can still be accessed on Microsoft Azure.
This is an interesting result!
It seems to support LeCun’s argument against autoregressive LLMs more than “simulator theory”.
One potential weakness about your method is that you didn’t use a base (foundation) model, but apparently the heavily finetuned gpt-3.5-turbo. The different system prompts probably can’t negate the effect of this common fine-tuning completely. It would be interesting how the results hold up when you use code-davinci-002, the GPT-3.5 base model, which has no instruction tuning or RLHF applied. Though this model is no longer available via OpenAI, it can still be accessed on Microsoft Azure.