Jozdien comments on The Waluigi Effect (mega-post)

Jozdien 3 Mar 2023 8:16 UTC
10 points
7
There is an advantage here in that you don’t need to pay for translation from an alien ontology—the process by which you simulate characters having beliefs that lead to outputs should remain mostly the same. You would need to specify a simulacrum that is honest though, which is pretty difficult and isomorphic to ELK in the fully general case of any simulacra, but it’s in a space that’s inherently trope-weighted; so simulating humans that are being honest about their beliefs should be made a lot easier (but plausibly still not easy in absolute terms) because humans are often honest, and simulating honest superintelligent assistants or whatever should be near ELK-difficult because you don’t get advantages from the prior’s specification doing a lot of work for you.
Related, somewhat.
- leogao 3 Mar 2023 8:40 UTC
  3 points
  0
  Parent
  You don’t need to pay for translation to simulate human level characters, because that’s just learning the human simulator. You do need to pay for translation to access superhuman behavior (which is the case ELK is focused on).
  - Jozdien 3 Mar 2023 10:57 UTC
    3 points
    0
    Parent
    Yeah, but the reasons for both seem slightly different—in the case of simulators, because the training data doesn’t trope-weigh superintelligences as being honest. You could easily have a world where ELK is still hard but simulating honest superintelligences isn’t.
    - leogao 3 Mar 2023 18:38 UTC
      3 points
      0
      Parent
      I think the problems are roughly equivalent. Creating training data that trope weights superintelligences as honest requires you to access sufficiently superhuman behavior, and you can’t just elide the demonstration of superhumanness, because that just puts it in the category of simulacra that merely profess to be superhuman.
      - Jozdien 3 Mar 2023 21:20 UTC
        2 points
        0
        Parent
        I think the relevant idea is what properties would be associated with superintelligences drawn from the prior? We don’t really have a lot of training data associated with superhuman behaviour on general tasks, yet we can probably draw it out of powerful interpolation. So properties associated with that behaviour would also have to be sampled from the human prior of what superintelligences are like—and if we lived in a world where superintelligences were universally described as being honest, why would that not have the same effect as one where humans are described as honest resulting in sampling honest humans being easy?