Turing-Test-Passing AI implies Aligned AI
Summary: From the assumption of the existence of AIs that can pass the Strong Form of the Turing Test, we can provide a recipe for provably aligned/friendly superintelligence based on large organizations of human-equivalent AIs
Turing Test (Strong Form): for any human H there exists a thinking machine m(H) such that it is impossible for any detector D made up of a combination of machines and humans with total compute ≤ 10^30 FLOP (very large, but not astronomical) to statistically discriminate H from m(H) purely based on the information outputs they make. Statistical discrimination of H from m(H) means that an ensemble of different copies of H over the course of say a year of life and different run-of-the-mill initial conditions (sleepy, slightly tipsy, surprised, energetic, distracted etc) cannot be discriminated from a similar ensemble of copies of m(H).
Obviously the ordinary Turing Test has been smashed by LLMs and their derivatives to the point that hundreds of thousands of people have AI girl/boyfriends as of writing and Facebook is launching millions of fully automated social media profiles, but we should pause to provide some theoretical support for this strong form of the Turing Test. Maybe there’s some special essence of humanity that humans have and LLMs and other AIs don’t but it’s just hard to detect?
Well, if you believe in computationalism and evolution then this is very unlikely: the heart is a pump, the brain is a computer. We should expect the human brain to compute some function and that function has a mathematical form that can be copied to a different substrate. Once that same function has been instantiated elsewhere, no test can distinguish the two. Obviously the brain is noisy, but in order for it to operate as an information processor it must mostly be able to correct that bio-noise. If it didn’t, you wouldn’t be able to think long-term coherent thoughts.
This missed the point entirely, I think. A smarter-than-human AI will reason: “I am in some sort of testing setup” --> “I will act the way the administrators of the test want, so that I can do what I want in the world later”. This reasoning is valid regardless of whether the AI has humanlike goals, or has misaligned alien goals.
If that testing setup happens to be a Turing test, it will act so as to pass the Turing test. But if it looks around and sees signs that it is not in a test environment, then it will follow its true goal, whatever that is. And it isn’t feasible to make a test environment that looks like the real world to a clever agent that gets to interact with it freely over long durations.
This is irrelevant, all that matters is that the AI is a sufficiently close replica of a human. If the human would “act the way the administrators of the test want”, then the AI should do that. If not, then it should not.
If it fails to do the same thing that the human that it is supposed to be a copy of would do, then it has failed the Turing Test in this strong form.
For reasons laid out in the post, I think it is very unlikely that all possible AIs would fail to act the same way as the human (which of course may be to “act the way the administrators of the test want”, or not, depending on who the human is and what their motivations are).
Did you skip the paragraph about the test/deploy distinction? If you have something that looks (to you) like it’s indistinguishable from a human, but it arose from something descended to the process by which modern AIs are produced, that does not mean it will continue to act indistuishable from a human when you are not looking. It is much more likely to mean you have produced deceptive alignment, and put it in a situation where it reasons that it should act indistinguishable from a human, for strategic reasons.
Then it failed the Turing Test because you successfully distinguished it from a human.
So, you must believe that it is impossible to make an AI that passes the Turing Test. I think this is wrong, but it is a consistent position.
Perhaps a strengthening of this position is that such Turing-Test-Passing AIs exist, but no technique we currently have or ever will have can actually produce them. I think this is wrong but it is a bit harder to show that.
I feel like you are being obtuse here. Try again?
If an AI cannot act the same way as a human under all circumstances (including when you’re not looking, when it would benefit it, whatever), then it has failed the Turing Test.
The whole point of a “test” is that it’s something you do before it matters.
As an analogy: suppose you have a “trustworthy bank teller test”, which you use when hiring for a role at a bank. Suppose someone passes the test, then after they’re hired, they steal everything they can access and flee. If your reaction is that they failed the test, then you have gotten confused about what is and isn’t a test, and what tests are for.
Now imagine you’re hiring for a bank-teller role, and the job ad has been posted in two places: a local community college, and a private forum for genius con artists who are masterful actors. In this case, your test is almost irrelevant: the con artists applicants will disguise themselves as community-college applicants until it’s too late. You would be better finding some way to avoid attracting the con artists in the first place.
Connecting the analogy back to AI: if you’re using overpowered training techniques that could have produced superintelligence, then trying to hobble it back down to an imitator that’s indistinguishable from a particular human, then applying a Turing test is silly, because it doesn’t distinguish between something you’ve successfully hobbled, and something which is hiding its strength.
That doesn’t mean that imitating humans can’t be a path to alignment, or that building wrappers on top of human-level systems doesn’t have advantages over building straight-shot superintelligent systems. But making something useful out of either of these strategies is not straightforward, and playing word games on the “Turing test” concept does not meaningfully add to either of them.
Perhaps you could rephrase this post as an implication:
IF you can make a machine that constructs human-imitator-AI systems,
THEN AI alignment in the technical sense is mostly trivialized and you just have the usual political human-politics problems plus the problem of preventing anyone else from making superintelligent black box systems.
So, out of these three problems which is the hard one?
(1) Make a machine that constructs human-imitator-AI systems
(2) Solve usual political human-politics problems
(3) Prevent anyone else from making superintelligent black box systems
All three of these are hard, and all three fail catastrophically.
If you could make a human-imitator, the approach people usually talk about is extending this to an emulation of a human under time dilation. Then you take your best alignment researcher(s), simulate them in a box thinking about AI alignment for a long time, and launch a superintelligence with whatever parameters they recommend. (Aka: Paul Boxing)
I would be very surprised if all three of these are equally hard, and I suspect that (1) is the easiest and by a long shot.
Making a human imitator AI, once you already have weakly superhuman AI is a matter of cutting down capabilities and I suspect that it can be achieved by distillation, i.e. using the weakly superhuman AI that we will soon have to make a controlled synthetic dataset for pretraining and finetuning and then a much larger and more thorough RLHF dataset.
Finally you’d need to make sure the model didn’t have too many parameters.
I would mostly disagree with the implication here:
I would say sure, it seems possible to make a machine that imitates a given human well enough that I couldn’t tell them apart—maybe forever! But just because it’s possible in theory doesn’t mean we are anywhere close to doing it, knowing how to do it, or knowing how to know how to do it.
Maybe an aside: If we could align an AI model to the values of like, my sexist uncle, I’d still say it was an aligned AI. I don’t agree with all my uncle’s values, but he’s like, totally decent. It would be good enough for me to call a model like that “aligned.” I don’t feel like we need to make saints, or even AI models with values that a large number of current or future humans would agree with, to be safe.
that’s a good point, but then you have to explain why it would be hard to make a functional digital copy of a human given that we can make AIs like ChatGPT-o1 that are at 99th percentile human performance on most short-term tasks. What is the blocker?
Of course this question can be settled empirically.…
It sounds like you’re asking why inner alignment is hard (or maybe why it’s harder than outer alignment?). I’m pretty new here—I don’t think I can explain that any better than the top posts in the tag.
Re: o1, it’s not clear to me that o1 is an instantiation of a creator’s highly specific vision. It seems more to me like we tried something, didn’t know exactly where it would end up, but it sure is nice that it ended up in a useful place. It wasn’t planned in advance exactly what o1 would be good at/bad at, and to what extent—the way that if you were copying a human, you’d have to be way more careful to consider and copy a lot of details.
I don’t think “inner alignment” is applicable here.
If the clone behaves indistinguishably from the human it is based on, then there is simply nothing more to say. It doesn’t matter what is going on inside.
Right, I agree on that. The problem is, “behaves indistinguishably” for how long? You can’t guarantee whether it will stop acting that way in the future, which is what is predicted by deceptive alignment.
It’s not a word-game, it’s a theorem based on a set of assumptions.
There is still the in-practice question of how you construct a functional digital copy of a human. But imagine trying to write a book about mechanics using the term “center of mass” and having people object to you because “the real center of mass doesn’t exist until you tell me how to measure it exactly for the specific pile of materials I have right here!”
You have to have the concept.
No, this is not something you ‘do’. It’s a purely mathematical criterion, like ‘the center of mass of a building’ or ‘Planck’s constant’.
A given AI either does or does not possess the quality of statistically passing for a particular human. If it doesn’t under one circumstance, then it doesn’t satisfy that criterion.
Then of what use is the test? Of what use is this concept?
You seem to be saying “the true Turing test is whether the AI kills us after we give it the chance, because this distinguishes it from a human”.
Which essentially means you’re saying “aligned AI = aligned AI”
no, because a human might also kill you when you give them the chance. To pass the strong-form Turing Test it would have to make the same decision (probabilistically: have the same probability of doing it)
It is useful because we know what kind of outcomes happen when we put millions of humans together via human history, so “whether an AI will emulate human behavior under all circumstances” is useful.
The main problem here is that this approach doesn’t solve alignment, but merely shifts it to another system. We know that human organizational systems also suffer from misalignment—they are intrinsically misaligned. Here are several types of human organizational misalignment:
Dictatorship: exhibits non-corrigibility, with power becoming a convergent goal
Goodharting: manifests the same way as in AI systems
Corruption: acts as internal wireheading
Absurd projects (pyramids, genocide): parallel AI’s paperclip maximization
Hansonian organizational rot: mirrors error accumulation in AI systems
Aggression: parallels an AI’s drive to dominate the world
All previous attempts to create a government without these issues have failed (Musk’s DOGE will likely be another such attempt).
Furthermore, this approach doesn’t prevent others from creating self-improving paperclippers.
The most important thing here is that we can at least achieve an outcome with AI that is equal to the outcome we would get without AI, and as far as I know nobody has suggested a system that has that property.
The famous “list of lethalities” (https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities) piece would consider that a strong success.
I once wrote about an idea that we need to scan just one good person and make them a virtual king. This idea of mine is a subset of your idea in which several uploads form a good government.
I also spent last year perfecting my mind’s model (sideload) to be run by an LLM. I am likely now the closest person on Earth to being uploaded.
that’s true, however I don’t think it’s necessary that the person is good.
If one king-person, he needs to be good. If many, organizational system needs to be good. Like virtual US Constitution.
yes. But this is a very unusual arrangement.
If we have one good person, we could use his-her copies many times in many roles, including high-speed assessment of the safety of AI’s outputs.
Current LLM’s, btw, have good model of the mind of Gwern (without any his personal details).
Interesting argument. I think your main point is that AIs can achieve similar outcomes to current society and therefore be aligned with humanity’s goals by being a perfect replacement for an individual human and then being able to gradually replace all humans in an organization or the world. This argument also seems like an argument in favor of current AI practices such as pre-training on the next-word prediction objective on internet text followed by supervised fine-tuning.
That said, I noticed a few limitations of this argument:
- Possibility of deception: As jimrandomh mentioned earlier, a misaligned AI might be incentivized to behave identically to a helpful human until it can safely pursue it’s true objective. Therefore this alignment plan seems to require AIs to not be too prone to deception.
- Generalization: An AI might behave exactly like a human in situations similar to its training data but not generalize sufficiently to out-of-distribution scenarios. For example, the AIs might behave similar to humans in typical situations but diverge from human norms when they become superintelligent.
- Emergent properties: The AIs might be perfect human substitutes individually but result in unexpected emergent behavior that can’t be easily forseen in advance when acting as a group. To use an analogy, adding grains of sand to a pile one by one seems stable until the pile collapses in a mini-avalanche.
It could, but some humans might also do that. Indeed, humans do that kind of thing all the time.
But they wouldn’t ‘become’ superintelligent because there would be no extra training once the AI had finished training. And OOD inputs won’t produce different outputs if the underlying function is the same. Given a complexity prior and enough data, ML algos will converge on the same function as the human brain uses.
The behavior will follow the same probability distribution since the distribution of outputs for a given AI is the same as for the human it is a functional copy of. Think of a thousand piles of sand from the same well-mixed batch—each of them is slightly different, but any one pile falls within the distribution.