The BFO can generally cope with humans observing zt=f(xt) and modifying our behaviour because of it (ie does not need a counterfactual approach).
Stuart, what’s your current opinion on backward-facing oracles? I ask because it seems like in a later post you changed your mind and went back to thinking that a counterfactual approach is needed to avoid manipulation after all. Is that right?
However, it is not a fixed point that the BFO is likely to find, because zi+ϵ
would not be such an encoding for almost all ϵ
, so the basin of attraction for this zi
is tiny (basically only those ϵ
sufficiently small to preserve all the digits of zi
). Thus the BFO is very unlikely to stumble upon it.
This seems to constitute a form of overfitting or failure to generalize, that is likely to be removed by future advances in ML or AI in general (if there aren’t already proposed regularizers that can do so). (A more capable AI wouldn’t need to have actually “sumbled upon” the basin of attraction of this fixed point in its past data in order to extrapolate that it exists.) If you stick to using the “standard ML” BFO in order to be safe from such manipulation, wouldn’t you lose out in competitiveness against other AIs due to this kind of overfitting / failure to generalize?
Stuart, what’s your current opinion on backward-facing oracles? I ask because it seems like in a later post you changed your mind and went back to thinking that a counterfactual approach is needed to avoid manipulation after all. Is that right?
Basically, yes. I’m no longer sure the terminology and concept of BFO is that useful, and I think all self-confirming oracles have problems. I also believe that these problems need not be cleanly of a “manipulation” or “non-manipulation” type.
I also believe there are smoother ways to reach “manipulation”, so my point starting “However, it is not a fixed point that the BFO is likely to find...” is wrong.
I’ll add a comment at the beginning of the post, clarifying this post is no longer my current best model.
Stuart, what’s your current opinion on backward-facing oracles? I ask because it seems like in a later post you changed your mind and went back to thinking that a counterfactual approach is needed to avoid manipulation after all. Is that right?
This seems to constitute a form of overfitting or failure to generalize, that is likely to be removed by future advances in ML or AI in general (if there aren’t already proposed regularizers that can do so). (A more capable AI wouldn’t need to have actually “sumbled upon” the basin of attraction of this fixed point in its past data in order to extrapolate that it exists.) If you stick to using the “standard ML” BFO in order to be safe from such manipulation, wouldn’t you lose out in competitiveness against other AIs due to this kind of overfitting / failure to generalize?
Basically, yes. I’m no longer sure the terminology and concept of BFO is that useful, and I think all self-confirming oracles have problems. I also believe that these problems need not be cleanly of a “manipulation” or “non-manipulation” type.
I also believe there are smoother ways to reach “manipulation”, so my point starting “However, it is not a fixed point that the BFO is likely to find...” is wrong.
I’ll add a comment at the beginning of the post, clarifying this post is no longer my current best model.