Roko

Karma: 6,281

Roko Mar 16, 2025, 1:18 AM
2 points
0
in reply to: Roko’s comment on: Roko’s Shortform
Apparently nobody has done this?

“You’re correct—none of the studies cited have used a strict GAN-like architecture with a generator and discriminator trained simultaneously in lockstep to prevent AI deception.

Most existing AI safety research, such as Redwood Research or Anthropic’s work, employs sequential or iterative adversarial methods: typically, a model generates potentially deceptive outputs, these outputs are collected, and separately a discriminator or classifier is trained to catch deception. These processes are usually iterative but asynchronous, rather than the simultaneous, competitive co-training seen in GAN architectures.

Your original concept—training generator (task-performing AI) and discriminator (deception detector) networks simultaneously in a true GAN-style loop—does not appear explicitly explored in AI safety literature so far.”

Roko Mar 16, 2025, 12:56 AM
2 points
0
in reply to: Roko’s comment on: Roko’s Shortform
ChatGPT Deep Research produced this:

https://chatgpt.com/share/67d62105-7c6c-8002-8bbb-74982455839b

Roko Mar 15, 2025, 9:48 PM
1 point
−2
on: Roko’s Shortform
Preventing deceptive AI misalignment via something like a GAN architecture seems fruitful—you have a generator network that performs some task T, with a score function T().

You then create a dataset of special tasks within the scope of T which have a deceptive answer, and an honest answer which scores lower according to T(). You split this deceptive alignment dataset into a train set and a test set.

Then you train both the generator network and a separate discriminator network—the discriminator is trained to spot deception using the training set and generator is rewarded for maximizing the score T() but penalized when it is caught cheating by the discriminator on the test set. Somehow you prevent the discriminator from memorizing the test set—e.g. you stream in a few examples per iteration and you don’t reuse them. The discriminator may have access to the weights of the generator, or to its chain of thought.

There are some potential extensions of this line of work, e.g. you can have a holdout set of deceptive alignment examples that neither the generator nor the discriminator have seen, and you can use these to track progress and to see how the process is evolves over time.

Has anyone actually tried this?

Roko Jan 13, 2025, 6:56 PM
2 points
0
in reply to: Spencer Ericson’s comment on: Turing-Test-Passing AI implies Aligned AI

You can’t guarantee whether it will stop acting that way in the future, which is what is predicted by deceptive alignment.

yes, that’s true. But in fact if your AI is merely supposed to imitate a human it will be much easier to prevent deceptive alignment because you can find the minimal model that mimics a human, and that minimality excludes exotic behaviors.

This is essentially why machine learning works at all—you don’t pick a random model that fits your training data well, you pick the smallest one.

Roko Jan 2, 2025, 2:10 PM
2 points
0
in reply to: avturchin’s comment on: Turing-Test-Passing AI implies Aligned AI

If one king-person

yes. But this is a very unusual arrangement.

Roko Jan 2, 2025, 1:19 AM
2 points
0
in reply to: avturchin’s comment on: Turing-Test-Passing AI implies Aligned AI
that’s true, however I don’t think it’s necessary that the person is good.

Roko Jan 2, 2025, 1:17 AM
2 points
−1
in reply to: Spencer Ericson’s comment on: Turing-Test-Passing AI implies Aligned AI

asking why inner alignment is hard

I don’t think “inner alignment” is applicable here.

If the clone behaves indistinguishably from the human it is based on, then there is simply nothing more to say. It doesn’t matter what is going on inside.

Roko Jan 1, 2025, 8:29 PM
2 points
0
in reply to: avturchin’s comment on: Turing-Test-Passing AI implies Aligned AI
The most important thing here is that we can at least achieve an outcome with AI that is equal to the outcome we would get without AI, and as far as I know nobody has suggested a system that has that property.

The famous “list of lethalities” (https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities) piece would consider that a strong success.

Roko Jan 1, 2025, 1:49 PM
2 points
0
in reply to: Spencer Ericson’s comment on: Turing-Test-Passing AI implies Aligned AI
just because it’s possible in theory doesn’t mean we are anywhere close to doing it
that’s a good point, but then you have to explain why it would be hard to make a functional digital copy of a human given that we can make AIs like ChatGPT-o1 that are at 99th percentile human performance on most short-term tasks. What is the blocker?

Of course this question can be settled empirically.…

Roko Jan 1, 2025, 8:35 AM
3 points
0
in reply to: jimrandomh’s comment on: Turing-Test-Passing AI implies Aligned AI

All three of these are hard, and all three fail catastrophically.

I would be very surprised if all three of these are equally hard, and I suspect that (1) is the easiest and by a long shot.

Making a human imitator AI, once you already have weakly superhuman AI is a matter of cutting down capabilities and I suspect that it can be achieved by distillation, i.e. using the weakly superhuman AI that we will soon have to make a controlled synthetic dataset for pretraining and finetuning and then a much larger and more thorough RLHF dataset.

Finally you’d need to make sure the model didn’t have too many parameters.

Roko Jan 1, 2025, 12:04 AM
2 points
0
in reply to: jimrandomh’s comment on: Turing-Test-Passing AI implies Aligned AI
Perhaps you could rephrase this post as an implication:

IF you can make a machine that constructs human-imitator-AI systems,

THEN AI alignment in the technical sense is mostly trivialized and you just have the usual political human-politics problems plus the problem of preventing anyone else from making superintelligent black box systems.

So, out of these three problems which is the hard one?

(1) Make a machine that constructs human-imitator-AI systems

(2) Solve usual political human-politics problems

(3) Prevent anyone else from making superintelligent black box systems

Roko Dec 31, 2024, 10:28 PM
2 points
−1
in reply to: Stephen McAleese’s comment on: Turing-Test-Passing AI implies Aligned AI

a misaligned AI might be incentivized to behave identically to a helpful human until it can safely pursue it’s true objective

It could, but some humans might also do that. Indeed, humans do that kind of thing all the time.

AIs might behave similar to humans in typical situations but diverge from human norms when they become superintelligent.

But they wouldn’t ‘become’ superintelligent because there would be no extra training once the AI had finished training. And OOD inputs won’t produce different outputs if the underlying function is the same. Given a complexity prior and enough data, ML algos will converge on the same function as the human brain uses.

The AIs might be perfect human substitutes individually but result in unexpected emergent behavior that can’t be easily forseen in advance when acting as a group. To use an analogy, adding grains of sand to a pile one by one seems stable until the pile collapses in a mini-avalanche.

The behavior will follow the same probability distribution since the distribution of outputs for a given AI is the same as for the human it is a functional copy of. Think of a thousand piles of sand from the same well-mixed batch—each of them is slightly different, but any one pile falls within the distribution.

Roko Dec 31, 2024, 10:22 PM
2 points
0
in reply to: Hide’s comment on: Turing-Test-Passing AI implies Aligned AI

“the true Turing test is whether the AI kills us after we give it the chance, because this distinguishes it from a human”.

no, because a human might also kill you when you give them the chance. To pass the strong-form Turing Test it would have to make the same decision (probabilistically: have the same probability of doing it)

Of what use is this concept?

It is useful because we know what kind of outcomes happen when we put millions of humans together via human history, so “whether an AI will emulate human behavior under all circumstances” is useful.

Roko Dec 31, 2024, 10:19 PM
2 points
−2
in reply to: jimrandomh’s comment on: Turing-Test-Passing AI implies Aligned AI

playing word games on the “Turing test” concept does not meaningfully add

It’s not a word-game, it’s a theorem based on a set of assumptions.

There is still the in-practice question of how you construct a functional digital copy of a human. But imagine trying to write a book about mechanics using the term “center of mass” and having people object to you because “the real center of mass doesn’t exist until you tell me how to measure it exactly for the specific pile of materials I have right here!”

You have to have the concept.

Roko Dec 31, 2024, 10:12 PM
0 points
−2
in reply to: jimrandomh’s comment on: Turing-Test-Passing AI implies Aligned AI
The whole point of a “test” is that it’s something you do before it matters.
No, this is not something you ‘do’. It’s a purely mathematical criterion, like ‘the center of mass of a building’ or ‘Planck’s constant’.
A given AI either does or does not possess the quality of statistically passing for a particular human. If it doesn’t under one circumstance, then it doesn’t satisfy that criterion.

Roko Dec 31, 2024, 9:28 PM
−1 points
−6
in reply to: jimrandomh’s comment on: Turing-Test-Passing AI implies Aligned AI
If an AI cannot act the same way as a human under all circumstances (including when you’re not looking, when it would benefit it, whatever), then it has failed the Turing Test.

Roko Dec 31, 2024, 9:16 PM
−3 points
−4
in reply to: jimrandomh’s comment on: Turing-Test-Passing AI implies Aligned AI

that does not mean it will continue to act indistuishable from a human when you are not looking

Then it failed the Turing Test because you successfully distinguished it from a human.

So, you must believe that it is impossible to make an AI that passes the Turing Test. I think this is wrong, but it is a consistent position.

Perhaps a strengthening of this position is that such Turing-Test-Passing AIs exist, but no technique we currently have or ever will have can actually produce them. I think this is wrong but it is a bit harder to show that.

Roko Dec 31, 2024, 8:47 PM
2 points
−9
in reply to: jimrandomh’s comment on: Turing-Test-Passing AI implies Aligned AI
This is irrelevant, all that matters is that the AI is a sufficiently close replica of a human. If the human would “act the way the administrators of the test want”, then the AI should do that. If not, then it should not.

If it fails to do the same thing that the human that it is supposed to be a copy of would do, then it has failed the Turing Test in this strong form.

For reasons laid out in the post, I think it is very unlikely that all possible AIs would fail to act the same way as the human (which of course may be to “act the way the administrators of the test want”, or not, depending on who the human is and what their motivations are).

Turing-Test-Passing AI implies Aligned AI

RokoDec 31, 2024, 7:59 PM

−1 points

29 comments5 min readLW link

Roko Dec 18, 2024, 5:20 AM
2 points
0
in reply to: Seth Herd’s comment on: The Dissolution of AI Safety

How can we solve that coordination problem? I have yet to hear a workable idea.

This is my next project!

Roko

Tur­ing-Test-Pass­ing AI im­plies Aligned AI

Turing-Test-Passing AI implies Aligned AI