But if I want to help Bob figure out whether he should vote for Alice—whether voting for Alice would ultimately help create the kind of society he wants—that can’t be done by trial and error. To solve such tasks we need to understand what we are doing and why it will yield good outcomes. We still need to use data in order to improve over time, but we need to understand how to update on new data in order to improve.
Some examples of easy-to-measure vs. hard-to-measure goals:
Persuading me, vs. helping me figure out what’s true. (Thanks to Wei Dai for making this example crisp.)
Reducing my feeling of uncertainty, vs. increasing my knowledge about the world.
Improving my reported life satisfaction, vs. actually helping me live a good life.
Reducing reported crimes, vs. actually preventing crime.
Increasing my wealth on paper, vs. increasing my effective control over resources.
. . .
I think my true reason is not that all reasoning about humans is dangerous, but that it seems very difficult to separate out safe reasoning about humans from dangerous reasoning about humans
Thinking further, this is because of something like...the “good” strategies for engaging with humans are continuous with the “bad”strategies for engaging with humans (ie dark arts persuasion is continuous with good communication), but if your AI is only reasoning about a domain that doesn’t have humans than deceptive strategies are isolated in strategy space from the other strategies that work (namely, mastering the domain, instead of tricking the judge).
Because of this isolation of deceptive strategies, we can notice them more easily?
[Eli’s personal notes. Feel free to ignore or engage]
Related to the following, from here.
. . .
Thinking further, this is because of something like...the “good” strategies for engaging with humans are continuous with the “bad”strategies for engaging with humans (ie dark arts persuasion is continuous with good communication), but if your AI is only reasoning about a domain that doesn’t have humans than deceptive strategies are isolated in strategy space from the other strategies that work (namely, mastering the domain, instead of tricking the judge).
Because of this isolation of deceptive strategies, we can notice them more easily?