You could also just have a single AI construct a counterfactual model where it was replaced by a resistor, compute R relative to this model, then maximize the utility U’ = U—R. I like this better than the master/disciple model.
Aside from implementation, the tricky part seems like specifying R. If you specify it obvious ways like a negative overlap between probability distributions, the AI can think that R is low when really it has a small probability of a huge impact by human standards, which is bad.
You could also just have a single AI construct a counterfactual model where it was replaced by a resistor, compute R relative to this model, then maximize the utility U’ = U—R. I like this better than the master/disciple model.
Counterfactuals about “what could have gone differently in the past” seem (potentially) harder than “what will happen in a probabilistic event in the future”. I would like to use you model or something like it, it just seems a lot harder to specify.
Hmm, after some thought I ’m getting more pessimistic about finding a workable, broadly-effective definition of “different” with less complexity than “don’t do things this model of a human doesn’t want you to,” though I’d still put it above 20%.
Example: maybe using the AI’s predictive power “against paperclipping” would work. We want the AI to be free to change the inside of the box, but not want to exploit the outside for extra utility. So we use some magical method to sew together the inside of the box with a world that doesn’t communicate with the box, and have the utility U’ = 2*E(U in patchwork world) - E(U in real world), thus actually making having higher E(U) in the real world be bad. Does this work? No, the AI drops a meteor on its head in the real world to get the full points of almost 2.
You could also just have a single AI construct a counterfactual model where it was replaced by a resistor, compute R relative to this model, then maximize the utility U’ = U—R. I like this better than the master/disciple model.
Aside from implementation, the tricky part seems like specifying R. If you specify it obvious ways like a negative overlap between probability distributions, the AI can think that R is low when really it has a small probability of a huge impact by human standards, which is bad.
Counterfactuals about “what could have gone differently in the past” seem (potentially) harder than “what will happen in a probabilistic event in the future”. I would like to use you model or something like it, it just seems a lot harder to specify.
Hmm, after some thought I ’m getting more pessimistic about finding a workable, broadly-effective definition of “different” with less complexity than “don’t do things this model of a human doesn’t want you to,” though I’d still put it above 20%.
Example: maybe using the AI’s predictive power “against paperclipping” would work. We want the AI to be free to change the inside of the box, but not want to exploit the outside for extra utility. So we use some magical method to sew together the inside of the box with a world that doesn’t communicate with the box, and have the utility U’ = 2*E(U in patchwork world) - E(U in real world), thus actually making having higher E(U) in the real world be bad. Does this work? No, the AI drops a meteor on its head in the real world to get the full points of almost 2.