If an AI cannot act the same way as a human under all circumstances (including when you’re not looking, when it would benefit it, whatever), then it has failed the Turing Test.
The whole point of a “test” is that it’s something you do before it matters.
As an analogy: suppose you have a “trustworthy bank teller test”, which you use when hiring for a role at a bank. Suppose someone passes the test, then after they’re hired, they steal everything they can access and flee. If your reaction is that they failed the test, then you have gotten confused about what is and isn’t a test, and what tests are for.
Now imagine you’re hiring for a bank-teller role, and the job ad has been posted in two places: a local community college, and a private forum for genius con artists who are masterful actors. In this case, your test is almost irrelevant: the con artists applicants will disguise themselves as community-college applicants until it’s too late. You would be better finding some way to avoid attracting the con artists in the first place.
Connecting the analogy back to AI: if you’re using overpowered training techniques that could have produced superintelligence, then trying to hobble it back down to an imitator that’s indistinguishable from a particular human, then applying a Turing test is silly, because it doesn’t distinguish between something you’ve successfully hobbled, and something which is hiding its strength.
That doesn’t mean that imitating humans can’t be a path to alignment, or that building wrappers on top of human-level systems doesn’t have advantages over building straight-shot superintelligent systems. But making something useful out of either of these strategies is not straightforward, and playing word games on the “Turing test” concept does not meaningfully add to either of them.
Perhaps you could rephrase this post as an implication:
IF you can make a machine that constructs human-imitator-AI systems,
THEN AI alignment in the technical sense is mostly trivialized and you just have the usual political human-politics problems plus the problem of preventing anyone else from making superintelligent black box systems.
So, out of these three problems which is the hard one?
(1) Make a machine that constructs human-imitator-AI systems
(2) Solve usual political human-politics problems
(3) Prevent anyone else from making superintelligent black box systems
All three of these are hard, and all three fail catastrophically.
If you could make a human-imitator, the approach people usually talk about is extending this to an emulation of a human under time dilation. Then you take your best alignment researcher(s), simulate them in a box thinking about AI alignment for a long time, and launch a superintelligence with whatever parameters they recommend. (Aka: Paul Boxing)
All three of these are hard, and all three fail catastrophically.
I would be very surprised if all three of these are equally hard, and I suspect that (1) is the easiest and by a long shot.
Making a human imitator AI, once you already have weakly superhuman AI is a matter of cutting down capabilities and I suspect that it can be achieved by distillation, i.e. using the weakly superhuman AI that we will soon have to make a controlled synthetic dataset for pretraining and finetuning and then a much larger and more thorough RLHF dataset.
Finally you’d need to make sure the model didn’t have too many parameters.
I would mostly disagree with the implication here:
IF you can make a machine that constructs human-imitator-AI systems,
THEN AI alignment in the technical sense is mostly trivialized and you just have the usual political human-politics problems plus the problem of preventing anyone else from making superintelligent black box systems.
I would say sure, it seems possible to make a machine that imitates a given human well enough that I couldn’t tell them apart—maybe forever! But just because it’s possible in theory doesn’t mean we are anywhere close to doing it, knowing how to do it, or knowing how to know how to do it.
Maybe an aside: If we could align an AI model to the values of like, my sexist uncle, I’d still say it was an aligned AI. I don’t agree with all my uncle’s values, but he’s like, totally decent. It would be good enough for me to call a model like that “aligned.” I don’t feel like we need to make saints, or even AI models with values that a large number of current or future humans would agree with, to be safe.
just because it’s possible in theory doesn’t mean we are anywhere close to doing it
that’s a good point, but then you have to explain why it would be hard to make a functional digital copy of a human given that we can make AIs like ChatGPT-o1 that are at 99th percentile human performance on most short-term tasks. What is the blocker?
Of course this question can be settled empirically.…
It sounds like you’re asking why inner alignment is hard (or maybe why it’s harder than outer alignment?). I’m pretty new here—I don’t think I can explain that any better than the top posts in the tag.
Re: o1, it’s not clear to me that o1 is an instantiation of a creator’s highly specific vision. It seems more to me like we tried something, didn’t know exactly where it would end up, but it sure is nice that it ended up in a useful place. It wasn’t planned in advance exactly what o1 would be good at/bad at, and to what extent—the way that if you were copying a human, you’d have to be way more careful to consider and copy a lot of details.
I don’t think “inner alignment” is applicable here.
If the clone behaves indistinguishably from the human it is based on, then there is simply nothing more to say. It doesn’t matter what is going on inside.
If the clone behaves indistinguishably from the human it is based on, then there is simply nothing more to say. It doesn’t matter what is going on inside.
Right, I agree on that. The problem is, “behaves indistinguishably” for how long? You can’t guarantee whether it will stop acting that way in the future, which is what is predicted by deceptive alignment.
playing word games on the “Turing test” concept does not meaningfully add
It’s not a word-game, it’s a theorem based on a set of assumptions.
There is still the in-practice question of how you construct a functional digital copy of a human. But imagine trying to write a book about mechanics using the term “center of mass” and having people object to you because “the real center of mass doesn’t exist until you tell me how to measure it exactly for the specific pile of materials I have right here!”
The whole point of a “test” is that it’s something you do before it matters.
No, this is not something you ‘do’. It’s a purely mathematical criterion, like ‘the center of mass of a building’ or ‘Planck’s constant’.
A given AI either does or does not possess the quality of statistically passing for a particular human. If it doesn’t under one circumstance, then it doesn’t satisfy that criterion.
“the true Turing test is whether the AI kills us after we give it the chance, because this distinguishes it from a human”.
no, because a human might also kill you when you give them the chance. To pass the strong-form Turing Test it would have to make the same decision (probabilistically: have the same probability of doing it)
Of what use is this concept?
It is useful because we know what kind of outcomes happen when we put millions of humans together via human history, so “whether an AI will emulate human behavior under all circumstances” is useful.
If an AI cannot act the same way as a human under all circumstances (including when you’re not looking, when it would benefit it, whatever), then it has failed the Turing Test.
The whole point of a “test” is that it’s something you do before it matters.
As an analogy: suppose you have a “trustworthy bank teller test”, which you use when hiring for a role at a bank. Suppose someone passes the test, then after they’re hired, they steal everything they can access and flee. If your reaction is that they failed the test, then you have gotten confused about what is and isn’t a test, and what tests are for.
Now imagine you’re hiring for a bank-teller role, and the job ad has been posted in two places: a local community college, and a private forum for genius con artists who are masterful actors. In this case, your test is almost irrelevant: the con artists applicants will disguise themselves as community-college applicants until it’s too late. You would be better finding some way to avoid attracting the con artists in the first place.
Connecting the analogy back to AI: if you’re using overpowered training techniques that could have produced superintelligence, then trying to hobble it back down to an imitator that’s indistinguishable from a particular human, then applying a Turing test is silly, because it doesn’t distinguish between something you’ve successfully hobbled, and something which is hiding its strength.
That doesn’t mean that imitating humans can’t be a path to alignment, or that building wrappers on top of human-level systems doesn’t have advantages over building straight-shot superintelligent systems. But making something useful out of either of these strategies is not straightforward, and playing word games on the “Turing test” concept does not meaningfully add to either of them.
Perhaps you could rephrase this post as an implication:
IF you can make a machine that constructs human-imitator-AI systems,
THEN AI alignment in the technical sense is mostly trivialized and you just have the usual political human-politics problems plus the problem of preventing anyone else from making superintelligent black box systems.
So, out of these three problems which is the hard one?
(1) Make a machine that constructs human-imitator-AI systems
(2) Solve usual political human-politics problems
(3) Prevent anyone else from making superintelligent black box systems
All three of these are hard, and all three fail catastrophically.
If you could make a human-imitator, the approach people usually talk about is extending this to an emulation of a human under time dilation. Then you take your best alignment researcher(s), simulate them in a box thinking about AI alignment for a long time, and launch a superintelligence with whatever parameters they recommend. (Aka: Paul Boxing)
I would be very surprised if all three of these are equally hard, and I suspect that (1) is the easiest and by a long shot.
Making a human imitator AI, once you already have weakly superhuman AI is a matter of cutting down capabilities and I suspect that it can be achieved by distillation, i.e. using the weakly superhuman AI that we will soon have to make a controlled synthetic dataset for pretraining and finetuning and then a much larger and more thorough RLHF dataset.
Finally you’d need to make sure the model didn’t have too many parameters.
I would mostly disagree with the implication here:
I would say sure, it seems possible to make a machine that imitates a given human well enough that I couldn’t tell them apart—maybe forever! But just because it’s possible in theory doesn’t mean we are anywhere close to doing it, knowing how to do it, or knowing how to know how to do it.
Maybe an aside: If we could align an AI model to the values of like, my sexist uncle, I’d still say it was an aligned AI. I don’t agree with all my uncle’s values, but he’s like, totally decent. It would be good enough for me to call a model like that “aligned.” I don’t feel like we need to make saints, or even AI models with values that a large number of current or future humans would agree with, to be safe.
that’s a good point, but then you have to explain why it would be hard to make a functional digital copy of a human given that we can make AIs like ChatGPT-o1 that are at 99th percentile human performance on most short-term tasks. What is the blocker?
Of course this question can be settled empirically.…
It sounds like you’re asking why inner alignment is hard (or maybe why it’s harder than outer alignment?). I’m pretty new here—I don’t think I can explain that any better than the top posts in the tag.
Re: o1, it’s not clear to me that o1 is an instantiation of a creator’s highly specific vision. It seems more to me like we tried something, didn’t know exactly where it would end up, but it sure is nice that it ended up in a useful place. It wasn’t planned in advance exactly what o1 would be good at/bad at, and to what extent—the way that if you were copying a human, you’d have to be way more careful to consider and copy a lot of details.
I don’t think “inner alignment” is applicable here.
If the clone behaves indistinguishably from the human it is based on, then there is simply nothing more to say. It doesn’t matter what is going on inside.
Right, I agree on that. The problem is, “behaves indistinguishably” for how long? You can’t guarantee whether it will stop acting that way in the future, which is what is predicted by deceptive alignment.
It’s not a word-game, it’s a theorem based on a set of assumptions.
There is still the in-practice question of how you construct a functional digital copy of a human. But imagine trying to write a book about mechanics using the term “center of mass” and having people object to you because “the real center of mass doesn’t exist until you tell me how to measure it exactly for the specific pile of materials I have right here!”
You have to have the concept.
No, this is not something you ‘do’. It’s a purely mathematical criterion, like ‘the center of mass of a building’ or ‘Planck’s constant’.
A given AI either does or does not possess the quality of statistically passing for a particular human. If it doesn’t under one circumstance, then it doesn’t satisfy that criterion.
Then of what use is the test? Of what use is this concept?
You seem to be saying “the true Turing test is whether the AI kills us after we give it the chance, because this distinguishes it from a human”.
Which essentially means you’re saying “aligned AI = aligned AI”
no, because a human might also kill you when you give them the chance. To pass the strong-form Turing Test it would have to make the same decision (probabilistically: have the same probability of doing it)
It is useful because we know what kind of outcomes happen when we put millions of humans together via human history, so “whether an AI will emulate human behavior under all circumstances” is useful.