Dialogic RL assumes that the user has beliefs about the AI’s ontology. This includes the environment(fn1) from the AI’s perspective. In other words, the user needs to have beliefs about the AI’s counterfactuals (the things that would happen if the AI chooses different possible actions). But, what are the semantics of the AI’s counterfactuals from the user’s perspective? This is more or less the same question that was studied by the MIRI-sphere for a while, starting from Newcomb’s paradox, TDT et cetera. Luckily, I now have an answer based on the incomplete models formalism. This answer can be applied in this case also, quite naturally.
Specifically, we assume that there is a sense, meaningful to the user, in which ey select the AI policy (program the AI). Therefore, from the user’s perspective, the AI policy is a user action. Again from the user’s perspective, the AI’s actions and observations are all part of the outcome. The user’s beliefs about the user’s counterfactuals can therefore be expressed as σ:Π→Δ(A×O)ω(fn2), where Π is the space of AI policies(fn3). We assume that for every π∈Π, σ(π) is consistent with π the natural sense. Such a belief can be transformed into an incomplete model from the AI’s perspective, using the same technique we used to solve Newcomb-like decision problems, with σ playing the role of Omega. For a deterministic AI, this model looks like (i) at first, “Murphy” makes a guess that the AI’s policy is π=πguess (ii) The environment behaves according to the conditional measures of σ(πguess) (iii) If the AI’s policy ever deviates from πguess, the AI immediately enters an eternal “Nirvana” state with maximal reward. For a stochastic AI, we need to apply the technique with statistical tests and multiple models alluded to in the link. This can also be generalized to the setting where the user’s beliefs are already an incomplete model, by adding another step where Murphy chooses σ out of some set.
What we constructed is a method of translating counterfactuals from the user’s perspective to the AI’s perspective. In particular, the AI will inherit the user’s level of “updatelessness” (in the sense that, if the user’s counterfactuals are defined w.r.t. a particular effective precommitment point, the AI will use the same point). This translation may be implemented either (i) by the user, by explaining these semantics to em or (ii) by the AI, in which case the formal language should refer to the user’s counterfactuals rather than the AI’s counterfactuals.
(fn1) Up to an equivalence relation, that’s a mapping ν:(A×O)×A→ΔO.
(fn2) For infinite AI liftetime. We can trivially generalize this to allow for finite AI lifetime as well.
(fn3) Up to an equivalence relation, they are mappings π:(A×O)→ΔA. We may add computability/complexity constraints and represent them as programs.
Nirvana and the chicken rule both smell distasteful like proofs by contradiction, as though most everything worth doing can be done without them, and more canonically to boot.
(Conjecture: This can be proven, but only by contradiction.)
Dialogic RL assumes that the user has beliefs about the AI’s ontology. This includes the environment(fn1) from the AI’s perspective. In other words, the user needs to have beliefs about the AI’s counterfactuals (the things that would happen if the AI chooses different possible actions). But, what are the semantics of the AI’s counterfactuals from the user’s perspective? This is more or less the same question that was studied by the MIRI-sphere for a while, starting from Newcomb’s paradox, TDT et cetera. Luckily, I now have an answer based on the incomplete models formalism. This answer can be applied in this case also, quite naturally.
Specifically, we assume that there is a sense, meaningful to the user, in which ey select the AI policy (program the AI). Therefore, from the user’s perspective, the AI policy is a user action. Again from the user’s perspective, the AI’s actions and observations are all part of the outcome. The user’s beliefs about the user’s counterfactuals can therefore be expressed as σ:Π→Δ(A×O)ω(fn2), where Π is the space of AI policies(fn3). We assume that for every π∈Π, σ(π) is consistent with π the natural sense. Such a belief can be transformed into an incomplete model from the AI’s perspective, using the same technique we used to solve Newcomb-like decision problems, with σ playing the role of Omega. For a deterministic AI, this model looks like (i) at first, “Murphy” makes a guess that the AI’s policy is π=πguess (ii) The environment behaves according to the conditional measures of σ(πguess) (iii) If the AI’s policy ever deviates from πguess, the AI immediately enters an eternal “Nirvana” state with maximal reward. For a stochastic AI, we need to apply the technique with statistical tests and multiple models alluded to in the link. This can also be generalized to the setting where the user’s beliefs are already an incomplete model, by adding another step where Murphy chooses σ out of some set.
What we constructed is a method of translating counterfactuals from the user’s perspective to the AI’s perspective. In particular, the AI will inherit the user’s level of “updatelessness” (in the sense that, if the user’s counterfactuals are defined w.r.t. a particular effective precommitment point, the AI will use the same point). This translation may be implemented either (i) by the user, by explaining these semantics to em or (ii) by the AI, in which case the formal language should refer to the user’s counterfactuals rather than the AI’s counterfactuals.
(fn1) Up to an equivalence relation, that’s a mapping ν:(A×O)×A→ΔO.
(fn2) For infinite AI liftetime. We can trivially generalize this to allow for finite AI lifetime as well.
(fn3) Up to an equivalence relation, they are mappings π:(A×O)→ΔA. We may add computability/complexity constraints and represent them as programs.
Nirvana and the chicken rule both smell distasteful like proofs by contradiction, as though most everything worth doing can be done without them, and more canonically to boot.
(Conjecture: This can be proven, but only by contradiction.)
Maybe? I am not sure that I like Nirvana, but it doesn’t seem that bad. If someone thinks of a solution without it, I would be interested.