Or you might pretend to be good, and push on us in nefarious ways we wouldn’t even notice so that we end up saying things like “Go Eliezer”.
How would you tell the difference between him pretending to be good and being good, whatever you mean by good? (This is the basic question of rationality.)
That would be very hard to do in a public forum where he could read our methods of distinguish between him pretending to be good and actually being good. You could try to figure out what unconscious signals of “goodness” are, but again that’s hard through text where the person in question knows what you’re testing and can optimize the writing to score well on the test.
Or you could just go with your priors. Or message privately, though that runs into the problem of sampling people who are likely to agree with you.
How would you tell the difference between him pretending to be good and being good, whatever you mean by good? (This is the basic question of rationality.)
That would be very hard to do in a public forum where he could read our methods of distinguish between him pretending to be good and actually being good. You could try to figure out what unconscious signals of “goodness” are, but again that’s hard through text where the person in question knows what you’re testing and can optimize the writing to score well on the test.
Or you could just go with your priors. Or message privately, though that runs into the problem of sampling people who are likely to agree with you.
in principle, you can’t.
A rational evil would pretend to be good until it didn’t matter anymore