Why isn’t there yet a paper in Nature or Science called simply “LLMs pass the Turing Test”?
I know we’re kind of past that, and now we understand LLMs can be good at some things while bad at others. And the Turing Test is mainly interesting for its historical significance, not as the most informative test to run on AI. And I’m not even completely sure how much current LLMs pass the Turing Test (it will depend massively on the details of your Turing Test).
But my model of academia predicts that, by now, some senior ML academics would have paired up with some senior “running-experiments-on-humans-and-doing-science-on-the-results” academics (and possibly some labs), and put out an extremely exhaustive and high quality paper actually running a good Turing Test. If anything so that the community can coordinate around it, and make recent advancements more scientifically legible.
It’s not either like the sole value of the paper would be publicity and legibility. There are many important questions around how good LLMs are at passing as humans for deployment. And I’m not thinking either of something as shallow as “prompt GPT4 in a certain way”, but rather “work with the labs to actually optimize models for passing the test” (but of course don’t release them), which could be interesting for LLM science.
My best guess is that this project does already exist, but it took >1 year, and is now undergoing ~2 years of slow revisions or whatever (although I’d still be surprised they haven’t been able to put something out sooner?). It’s also possible that labs don’t want this kind of research/publicity (regardless of whether they are running similar experiments internally). Or deem it too risky to create such human-looking models, even if they wouldn’t release them. But I don’t think either of those is the case. And even if it was, the academics could still do some semblance of it through prompting alone, and probably it would already pass some versions of the Turing Test. (Now they have open-source models capable enough to do it, but that’s more recent.)
I think that some people are massively missing the point of the Turing test. The Turing test is not about understanding natural language. The idea of the test is, if an AI can behave indistinguishably from a human as far as any other human can tell, then obviously it has at least as much mental capability as humans have. For example, if humans are good at some task X, then you can ask the AI to solve the same task, and if it does poorly then it’s a way to distinguish the AI from a human.
The only issue is how long the test should take and how qualified the judge. Intuitively, it feels plausible that if an AI can withstand (say) a few hours of drilling by an expert judge, then it would do well even on tasks that take years for a human. It’s not obvious, but it’s at least plausible. And I don’t think existing AIs are especially near to passing this.
FWIW, I was just arguing here & here that I find it plausible that a near-future AI could pass a 2-hour Turing test while still being a paradigm-shift away from passing a 100-hour Turing test (or from being AGI / human-level intelligence in the relevant sense).
I have no idea whether Turing’s original motivation was this one (not that it matters much). But I agree that if we take time and judge expertise to the extreme we get what you say, and that current LLMs don’t pass that. Heck, even a trick as simple as asking for a positional / visual task (something like ARC AGI, even if completely text-based) would suffice. But I still would expect academics to be able to produce a pretty interesting paper on weaker versions of the test.
For a robot to pass the Turing Test turned out to be less a question about the robot and more a question about the human.
Against expert judges, I still think LLMs fail the Turing Test. I don’t think current AI can pretend to be a competent human in an extended conversation with another competent human.
Again non-expert judges, I think the Turing Test was technically passed long long before LLMs: didn’t some of the users of ELIZA think and act like it was human? And how does that make you feel?
I suspect the expert judges would need to resort to known jailbreaking techniques to distinguish LLMs. A fair interesting test might be against expert-but-not-in-ML judges.
The ELIZA users may have acted like it was human, but they didn’t think it was human. And they weren’t trying to find ways to test whether it was human or not. If they had been, they’d have knocked it over instantly.
Because present LLMs can’t pass the Turing test. They think at or above human level, for a lot of activities. But over the past couple of years, we’ve identified lots of tasks that they absolutely suck at, in a way no human would. So an assiduous judge would have no trouble distinguishing them from a human.
But, I hear you protest, that’s obeying the letter of the Turing test, not the spirit. To which I reply: well, now we’re arguing about the spirit of an ill-defined test from 74 years ago. Turing expected that the first intelligent machine would have near-human abilities across the board, instead of the weird spectrum of abilities that they actually have. AI didn’t turn out like Turing expected, rendering the question of whether a machine could pass his test a question without scientific interest.
Well, as you point out, it’s not that interesting a test, “scientifically” speaking.
But also they haven’t passed it and aren’t close.
The Turing test is adversarial. It assumes that the human judge is actively trying to distinguish the AI from another human, and is free to try anything that can be done through text.
I don’t think any of the current LLMs would pass with any (non-impaired) human judge who was motivated to put in a bit of effort. Not even if you used versions without any of the “safety” hobbling. Not even if the judge knew nothing about LLMs, prompting, jailbreaking, or whatever.
Nor do I think that the “labs” can create an LLM that comes close to passing using the current state of the art. Not with the 4-level generation, not with the 5-level generation, and I suspect probably not with the 6-level generation. There are too many weird human things you’d have to get right. And doing it with pure prompting is right out.
Even if they could, it is, as you suggested, an anti-goal for them, and it’s an expensive anti-goal. They’d be spending vast amounts of money to build something that they couldn’t use as a product, but that could be a huge PR liability.
And I’m not thinking either of something as shallow as “prompt GPT4 in a certain way”, but rather “work with the labs to actually optimize models for passing the test” (but of course don’t release them), which could be interesting for LLM science.
The low(er) quality paper you mentioned actually identified the main reason for GPT-4 failing in Turing tests: linguistic style, not intelligence. But this problem with writing style is not surprising. The authors used the publicly available ChatGPT-4, a fine-tuned model which
has an “engaging assistant which is eager to please” conversation style, which the model can’t always properly shake off even if it is instructed otherwise, and
likely suffers from a strongly degraded general ability to create varying writing styles, in contrast to the base model. Although no GPT-4 base model is publicly available, at least ChatGPT-3.5 is known to be much worse at writing fiction than the GPT-3.5 base model. Either the SL instruction tuning or the RLHF tuning messes with the models ability to imitate style.
So rather than doing complex fine-tuning for the Turing test, it seems advisable to only use a base model but with appropriate prompt scaffolding (“The following is a transcript of a Turing test conversation where both participants were human” etc) to create the chat environments. I think the strongest foundation model currently available to the public is Llama 3.1, which was released recently.
It’d be more feasible if pure RLAIF for arbitrary constitutions becomes competitive with RLHF first, to make chatbots post-trained to be more human-like without bothering the labs to an unreasonable degree. Only this year’s frontier models started passing reading comprehension tests well, older or smaller models often make silly mistakes about subtler text fragments. From this I’d guess this year’s frontier models might be good enough for preference labeling as human substitutes, while earlier models aren’t. But RLHF with humans is still in use, so probably not. The next generationcurrentlyin training will be very robust at reading comprehension, more likely good enough at preference labeling. Another question is if this kind of effort can actually produce convincing human mimicry, even with human labelers.
Why isn’t there yet a paper in Nature or Science called simply “LLMs pass the Turing Test”?
I know we’re kind of past that, and now we understand LLMs can be good at some things while bad at others. And the Turing Test is mainly interesting for its historical significance, not as the most informative test to run on AI. And I’m not even completely sure how much current LLMs pass the Turing Test (it will depend massively on the details of your Turing Test).
But my model of academia predicts that, by now, some senior ML academics would have paired up with some senior “running-experiments-on-humans-and-doing-science-on-the-results” academics (and possibly some labs), and put out an extremely exhaustive and high quality paper actually running a good Turing Test. If anything so that the community can coordinate around it, and make recent advancements more scientifically legible.
It’s not either like the sole value of the paper would be publicity and legibility. There are many important questions around how good LLMs are at passing as humans for deployment. And I’m not thinking either of something as shallow as “prompt GPT4 in a certain way”, but rather “work with the labs to actually optimize models for passing the test” (but of course don’t release them), which could be interesting for LLM science.
The only thing I’ve found is this lower quality paper.
My best guess is that this project does already exist, but it took >1 year, and is now undergoing ~2 years of slow revisions or whatever (although I’d still be surprised they haven’t been able to put something out sooner?).
It’s also possible that labs don’t want this kind of research/publicity (regardless of whether they are running similar experiments internally). Or deem it too risky to create such human-looking models, even if they wouldn’t release them. But I don’t think either of those is the case. And even if it was, the academics could still do some semblance of it through prompting alone, and probably it would already pass some versions of the Turing Test. (Now they have open-source models capable enough to do it, but that’s more recent.)
I think that some people are massively missing the point of the Turing test. The Turing test is not about understanding natural language. The idea of the test is, if an AI can behave indistinguishably from a human as far as any other human can tell, then obviously it has at least as much mental capability as humans have. For example, if humans are good at some task X, then you can ask the AI to solve the same task, and if it does poorly then it’s a way to distinguish the AI from a human.
The only issue is how long the test should take and how qualified the judge. Intuitively, it feels plausible that if an AI can withstand (say) a few hours of drilling by an expert judge, then it would do well even on tasks that take years for a human. It’s not obvious, but it’s at least plausible. And I don’t think existing AIs are especially near to passing this.
FWIW, I was just arguing here & here that I find it plausible that a near-future AI could pass a 2-hour Turing test while still being a paradigm-shift away from passing a 100-hour Turing test (or from being AGI / human-level intelligence in the relevant sense).
I have no idea whether Turing’s original motivation was this one (not that it matters much). But I agree that if we take time and judge expertise to the extreme we get what you say, and that current LLMs don’t pass that. Heck, even a trick as simple as asking for a positional / visual task (something like ARC AGI, even if completely text-based) would suffice. But I still would expect academics to be able to produce a pretty interesting paper on weaker versions of the test.
(Non-expert opinion).
For a robot to pass the Turing Test turned out to be less a question about the robot and more a question about the human.
Against expert judges, I still think LLMs fail the Turing Test. I don’t think current AI can pretend to be a competent human in an extended conversation with another competent human.
Again non-expert judges, I think the Turing Test was technically passed long long before LLMs: didn’t some of the users of ELIZA think and act like it was human? And how does that make you feel?
I suspect the expert judges would need to resort to known jailbreaking techniques to distinguish LLMs. A fair interesting test might be against expert-but-not-in-ML judges.
The ELIZA users may have acted like it was human, but they didn’t think it was human. And they weren’t trying to find ways to test whether it was human or not. If they had been, they’d have knocked it over instantly.
Because present LLMs can’t pass the Turing test. They think at or above human level, for a lot of activities. But over the past couple of years, we’ve identified lots of tasks that they absolutely suck at, in a way no human would. So an assiduous judge would have no trouble distinguishing them from a human.
But, I hear you protest, that’s obeying the letter of the Turing test, not the spirit. To which I reply: well, now we’re arguing about the spirit of an ill-defined test from 74 years ago. Turing expected that the first intelligent machine would have near-human abilities across the board, instead of the weird spectrum of abilities that they actually have. AI didn’t turn out like Turing expected, rendering the question of whether a machine could pass his test a question without scientific interest.
Well, as you point out, it’s not that interesting a test, “scientifically” speaking.
But also they haven’t passed it and aren’t close.
The Turing test is adversarial. It assumes that the human judge is actively trying to distinguish the AI from another human, and is free to try anything that can be done through text.
I don’t think any of the current LLMs would pass with any (non-impaired) human judge who was motivated to put in a bit of effort. Not even if you used versions without any of the “safety” hobbling. Not even if the judge knew nothing about LLMs, prompting, jailbreaking, or whatever.
Nor do I think that the “labs” can create an LLM that comes close to passing using the current state of the art. Not with the 4-level generation, not with the 5-level generation, and I suspect probably not with the 6-level generation. There are too many weird human things you’d have to get right. And doing it with pure prompting is right out.
Even if they could, it is, as you suggested, an anti-goal for them, and it’s an expensive anti-goal. They’d be spending vast amounts of money to build something that they couldn’t use as a product, but that could be a huge PR liability.
The low(er) quality paper you mentioned actually identified the main reason for GPT-4 failing in Turing tests: linguistic style, not intelligence. But this problem with writing style is not surprising. The authors used the publicly available ChatGPT-4, a fine-tuned model which
has an “engaging assistant which is eager to please” conversation style, which the model can’t always properly shake off even if it is instructed otherwise, and
likely suffers from a strongly degraded general ability to create varying writing styles, in contrast to the base model. Although no GPT-4 base model is publicly available, at least ChatGPT-3.5 is known to be much worse at writing fiction than the GPT-3.5 base model. Either the SL instruction tuning or the RLHF tuning messes with the models ability to imitate style.
So rather than doing complex fine-tuning for the Turing test, it seems advisable to only use a base model but with appropriate prompt scaffolding (“The following is a transcript of a Turing test conversation where both participants were human” etc) to create the chat environments. I think the strongest foundation model currently available to the public is Llama 3.1, which was released recently.
It’d be more feasible if pure RLAIF for arbitrary constitutions becomes competitive with RLHF first, to make chatbots post-trained to be more human-like without bothering the labs to an unreasonable degree. Only this year’s frontier models started passing reading comprehension tests well, older or smaller models often make silly mistakes about subtler text fragments. From this I’d guess this year’s frontier models might be good enough for preference labeling as human substitutes, while earlier models aren’t. But RLHF with humans is still in use, so probably not. The next generation currently in training will be very robust at reading comprehension, more likely good enough at preference labeling. Another question is if this kind of effort can actually produce convincing human mimicry, even with human labelers.