Vanessa Kosoy comments on What to do with imitation humans, other than asking them what the right thing to do is?

Vanessa Kosoy 28 Sep 2020 15:58 UTC
LW: 4 AF: 2
AF
You could try to infer human values from the “sideload” using my “Conjecture 5″ about the AIT definition of goal-directed intelligence. However, since it’s not an upload and, like you said, it can go off-distribution, that doesn’t seem very safe. More generally, alignment protocols should never be open-loop.

I’m also skeptical about IDA, for reasons not specific to your question (in particular, this), but making it open-loop is worse.

Gurkenglas’ answer seems to me like something that can work, if we can somehow be sure the sideload doesn’t become superintelligent, for example, given an imitation plateau.