Roko comments on Turing-Test-Passing AI implies Aligned AI

Roko 1 Jan 2025 0:04 UTC
2 points
0
Perhaps you could rephrase this post as an implication:

IF you can make a machine that constructs human-imitator-AI systems,

THEN AI alignment in the technical sense is mostly trivialized and you just have the usual political human-politics problems plus the problem of preventing anyone else from making superintelligent black box systems.

So, out of these three problems which is the hard one?

(1) Make a machine that constructs human-imitator-AI systems

(2) Solve usual political human-politics problems

(3) Prevent anyone else from making superintelligent black box systems
- jimrandomh 1 Jan 2025 0:40 UTC
  3 points
  0
  Parent
  All three of these are hard, and all three fail catastrophically.
  If you could make a human-imitator, the approach people usually talk about is extending this to an emulation of a human under time dilation. Then you take your best alignment researcher(s), simulate them in a box thinking about AI alignment for a long time, and launch a superintelligence with whatever parameters they recommend. (Aka: Paul Boxing)
  - Roko 1 Jan 2025 8:35 UTC
    2 points
    0
    Parent
    
    All three of these are hard, and all three fail catastrophically.
    
    I would be very surprised if all three of these are equally hard, and I suspect that (1) is the easiest and by a long shot.
    
    Making a human imitator AI, once you already have weakly superhuman AI is a matter of cutting down capabilities and I suspect that it can be achieved by distillation, i.e. using the weakly superhuman AI that we will soon have to make a controlled synthetic dataset for pretraining and finetuning and then a much larger and more thorough RLHF dataset.
    
    Finally you’d need to make sure the model didn’t have too many parameters.
- Spencer Ericson 1 Jan 2025 0:52 UTC
  1 point
  0
  Parent
  I would mostly disagree with the implication here:
  IF you can make a machine that constructs human-imitator-AI systems,
  THEN AI alignment in the technical sense is mostly trivialized and you just have the usual political human-politics problems plus the problem of preventing anyone else from making superintelligent black box systems.
  I would say sure, it seems possible to make a machine that imitates a given human well enough that I couldn’t tell them apart—maybe forever! But just because it’s possible in theory doesn’t mean we are anywhere close to doing it, knowing how to do it, or knowing how to know how to do it.
  Maybe an aside: If we could align an AI model to the values of like, my sexist uncle, I’d still say it was an aligned AI. I don’t agree with all my uncle’s values, but he’s like, totally decent. It would be good enough for me to call a model like that “aligned.” I don’t feel like we need to make saints, or even AI models with values that a large number of current or future humans would agree with, to be safe.
  - Roko 1 Jan 2025 13:49 UTC
    2 points
    0
    Parent
    just because it’s possible in theory doesn’t mean we are anywhere close to doing it
    that’s a good point, but then you have to explain why it would be hard to make a functional digital copy of a human given that we can make AIs like ChatGPT-o1 that are at 99th percentile human performance on most short-term tasks. What is the blocker?
    
    Of course this question can be settled empirically.…
    - Spencer Ericson 1 Jan 2025 22:24 UTC
      1 point
      0
      Parent
      It sounds like you’re asking why inner alignment is hard (or maybe why it’s harder than outer alignment?). I’m pretty new here—I don’t think I can explain that any better than the top posts in the tag.
      Re: o1, it’s not clear to me that o1 is an instantiation of a creator’s highly specific vision. It seems more to me like we tried something, didn’t know exactly where it would end up, but it sure is nice that it ended up in a useful place. It wasn’t planned in advance exactly what o1 would be good at/bad at, and to what extent—the way that if you were copying a human, you’d have to be way more careful to consider and copy a lot of details.
      - Roko 2 Jan 2025 1:17 UTC
        2 points
        −1
        Parent
        
        asking why inner alignment is hard
        
        I don’t think “inner alignment” is applicable here.
        
        If the clone behaves indistinguishably from the human it is based on, then there is simply nothing more to say. It doesn’t matter what is going on inside.
        Spencer Ericson 2 Jan 2025 20:44 UTC
        3 points
        2
        Parent
        If the clone behaves indistinguishably from the human it is based on, then there is simply nothing more to say. It doesn’t matter what is going on inside.
        Right, I agree on that. The problem is, “behaves indistinguishably” for how long? You can’t guarantee whether it will stop acting that way in the future, which is what is predicted by deceptive alignment.