And I’m not thinking either of something as shallow as “prompt GPT4 in a certain way”, but rather “work with the labs to actually optimize models for passing the test” (but of course don’t release them), which could be interesting for LLM science.
The low(er) quality paper you mentioned actually identified the main reason for GPT-4 failing in Turing tests: linguistic style, not intelligence. But this problem with writing style is not surprising. The authors used the publicly available ChatGPT-4, a fine-tuned model which
has an “engaging assistant which is eager to please” conversation style, which the model can’t always properly shake off even if it is instructed otherwise, and
likely suffers from a strongly degraded general ability to create varying writing styles, in contrast to the base model. Although no GPT-4 base model is publicly available, at least ChatGPT-3.5 is known to be much worse at writing fiction than the GPT-3.5 base model. Either the SL instruction tuning or the RLHF tuning messes with the models ability to imitate style.
So rather than doing complex fine-tuning for the Turing test, it seems advisable to only use a base model but with appropriate prompt scaffolding (“The following is a transcript of a Turing test conversation where both participants were human” etc) to create the chat environments. I think the strongest foundation model currently available to the public is Llama 3.1, which was released recently.
The low(er) quality paper you mentioned actually identified the main reason for GPT-4 failing in Turing tests: linguistic style, not intelligence. But this problem with writing style is not surprising. The authors used the publicly available ChatGPT-4, a fine-tuned model which
has an “engaging assistant which is eager to please” conversation style, which the model can’t always properly shake off even if it is instructed otherwise, and
likely suffers from a strongly degraded general ability to create varying writing styles, in contrast to the base model. Although no GPT-4 base model is publicly available, at least ChatGPT-3.5 is known to be much worse at writing fiction than the GPT-3.5 base model. Either the SL instruction tuning or the RLHF tuning messes with the models ability to imitate style.
So rather than doing complex fine-tuning for the Turing test, it seems advisable to only use a base model but with appropriate prompt scaffolding (“The following is a transcript of a Turing test conversation where both participants were human” etc) to create the chat environments. I think the strongest foundation model currently available to the public is Llama 3.1, which was released recently.