You could try to infer human values from the “sideload” using my “Conjecture 5″ about the AIT definition of goal-directed intelligence. However, since it’s not an upload and, like you said, it can go off-distribution, that doesn’t seem very safe. More generally, alignment protocols should never be open-loop.
I’m also skeptical about IDA, for reasons not specific to your question (in particular, this), but making it open-loop is worse.
Gurkenglas’ answer seems to me like something that can work, if we can somehow be sure the sideload doesn’t become superintelligent, for example, given an imitation plateau.
You could try to infer human values from the “sideload” using my “Conjecture 5″ about the AIT definition of goal-directed intelligence. However, since it’s not an upload and, like you said, it can go off-distribution, that doesn’t seem very safe. More generally, alignment protocols should never be open-loop.
I’m also skeptical about IDA, for reasons not specific to your question (in particular, this), but making it open-loop is worse.
Gurkenglas’ answer seems to me like something that can work, if we can somehow be sure the sideload doesn’t become superintelligent, for example, given an imitation plateau.