Yeah, but the reasons for both seem slightly different—in the case of simulators, because the training data doesn’t trope-weigh superintelligences as being honest. You could easily have a world where ELK is still hard but simulating honest superintelligences isn’t.
I think the problems are roughly equivalent. Creating training data that trope weights superintelligences as honest requires you to access sufficiently superhuman behavior, and you can’t just elide the demonstration of superhumanness, because that just puts it in the category of simulacra that merely profess to be superhuman.
I think the relevant idea is what properties would be associated with superintelligences drawn from the prior? We don’t really have a lot of training data associated with superhuman behaviour on general tasks, yet we can probably draw it out of powerful interpolation. So properties associated with that behaviour would also have to be sampled from the human prior of what superintelligences are like—and if we lived in a world where superintelligences were universally described as being honest, why would that not have the same effect as one where humans are described as honest resulting in sampling honest humans being easy?
Yeah, but the reasons for both seem slightly different—in the case of simulators, because the training data doesn’t trope-weigh superintelligences as being honest. You could easily have a world where ELK is still hard but simulating honest superintelligences isn’t.
I think the problems are roughly equivalent. Creating training data that trope weights superintelligences as honest requires you to access sufficiently superhuman behavior, and you can’t just elide the demonstration of superhumanness, because that just puts it in the category of simulacra that merely profess to be superhuman.
I think the relevant idea is what properties would be associated with superintelligences drawn from the prior? We don’t really have a lot of training data associated with superhuman behaviour on general tasks, yet we can probably draw it out of powerful interpolation. So properties associated with that behaviour would also have to be sampled from the human prior of what superintelligences are like—and if we lived in a world where superintelligences were universally described as being honest, why would that not have the same effect as one where humans are described as honest resulting in sampling honest humans being easy?