I see the problem of counterfactuals as essentially solved by quasi-Bayesianism, which behaves like UDT in all Newcomb-like situations. The source code in your presentation of the problem is more or less equivalent to Omega in Newcomb-like problems. A TRL agent can also reason about arbitrary programs, and learn that a certain program acts as a predictor for its own actions.
This approach has some similarity with material implication and proof-based decision theory, in the sense that out of several hypothesis about counterfactuals that are consistent with observations, the decisive role is played by the most optimistic hypothesis (the one that can be exploited for the most expected utility). However, it has no problem with global accounting and indeed it solves counterfactual mugging successfully.
It seems the approaches we’re using are similar, in that they both are starting from observation/action history with posited falsifiable laws, with the agent’s source code not known a priori, and the agent considering different policies.
Learning “my source code is A” is quite similar to learning “Omega predicts my action is equal to A()”, so these would lead to similar results.
Policy-dependent source code, then, corresponds to Omega making different predictions depending on the agent’s intended policy, such that when comparing policies, the agent has to imagine Omega predicting differently (as it would imagine learning different source code under policy-dependent source code).
Policy-dependent source code, then, corresponds to Omega making different predictions depending on the agent’s intended policy, such that when comparing policies, the agent has to imagine Omega predicting differently (as it would imagine learning different source code under policy-dependent source code).
Well, in quasi-Bayesianism for each policy you have to consider the worst-case environment in your belief set, which depends on the policy. I guess that in this sense it is analogous.
I see the problem of counterfactuals as essentially solved by quasi-Bayesianism, which behaves like UDT in all Newcomb-like situations. The source code in your presentation of the problem is more or less equivalent to Omega in Newcomb-like problems. A TRL agent can also reason about arbitrary programs, and learn that a certain program acts as a predictor for its own actions.
This approach has some similarity with material implication and proof-based decision theory, in the sense that out of several hypothesis about counterfactuals that are consistent with observations, the decisive role is played by the most optimistic hypothesis (the one that can be exploited for the most expected utility). However, it has no problem with global accounting and indeed it solves counterfactual mugging successfully.
It seems the approaches we’re using are similar, in that they both are starting from observation/action history with posited falsifiable laws, with the agent’s source code not known a priori, and the agent considering different policies.
Learning “my source code is A” is quite similar to learning “Omega predicts my action is equal to A()”, so these would lead to similar results.
Policy-dependent source code, then, corresponds to Omega making different predictions depending on the agent’s intended policy, such that when comparing policies, the agent has to imagine Omega predicting differently (as it would imagine learning different source code under policy-dependent source code).
Well, in quasi-Bayesianism for each policy you have to consider the worst-case environment in your belief set, which depends on the policy. I guess that in this sense it is analogous.