I just want to flag that this approach seems to assume that—before we build the Oracle—we design the Oracle (or the procedure that produces it) such that it will assign prior of zero to the second types of worlds.
If we use some arbitrary scaled-up supervised learning training process to train a model that does well on general question answering, we can’t just safely sidestep the malign prior problem by providing information/instructions about the prior as part of the question. The simulations of the model that distant superintelligences run may involve such inputs as well. (In those simulations the loss may end up being minimal for whatever output the superintelligence wants the model to yield; regardless of the prescriptive information about the prior in the input.)
I just want to flag that this approach seems to assume that—before we build the Oracle—we design the Oracle (or the procedure that produces it) such that it will assign prior of zero to the second types of worlds.
If we use some arbitrary scaled-up supervised learning training process to train a model that does well on general question answering, we can’t just safely sidestep the malign prior problem by providing information/instructions about the prior as part of the question. The simulations of the model that distant superintelligences run may involve such inputs as well. (In those simulations the loss may end up being minimal for whatever output the superintelligence wants the model to yield; regardless of the prescriptive information about the prior in the input.)