“The key point is, “applying the counterfactual belief that the predictor is always right” is not really well-defined”—What do you mean here?
I’m curious whether you’re referring to the same as or similar to the issue I was referencing in Counterfactuals for Perfect Predictors. The TLDR is that I was worried that it would be inconsistent for an agent that never pays in Parfait’s Hitchhiker to end up in town if the predictor is perfect, so that it wouldn’t actually be well-defined what the predictor was predicting. And the way I ended up resolving this was by imagining it as an agent that takes input and asking what it would output if given that inconsistent input. But not sure if you were referencing this kind of concern or something else.
It is not a mere “concern”, it’s the crux of problem really. What people in the AI alignment community have been trying to do is, starting with some factual and “objective” description of the universe (such a program or a mathematical formula) and deriving counterfactuals. The way it’s supposed to work is, the agent needs to locate all copies of itself or things “logically correlated” with itself (whatever that means) in the program, and imagine it is controlling this part. But a rigorous definition of this that solves all standard decision theoretic scenarios was never found.
Instead of doing that, I suggest a solution of different nature. In quasi-Bayesian RL, the agent never arrives at a factual and objective description of the universe. Instead, it arrives at a subjective description which already includes counterfactuals. I then proceed to show that, in Newcomb-like scenarios, such agents receive optimal expected utility (i.e. the same expected utility promised by UDT).
Yeah, I agree that the objective descriptions can leave out vital information, such as how the information you know was acquired, which seems important for determining the counterfactuals.
“The key point is, “applying the counterfactual belief that the predictor is always right” is not really well-defined”—What do you mean here?
I’m curious whether you’re referring to the same as or similar to the issue I was referencing in Counterfactuals for Perfect Predictors. The TLDR is that I was worried that it would be inconsistent for an agent that never pays in Parfait’s Hitchhiker to end up in town if the predictor is perfect, so that it wouldn’t actually be well-defined what the predictor was predicting. And the way I ended up resolving this was by imagining it as an agent that takes input and asking what it would output if given that inconsistent input. But not sure if you were referencing this kind of concern or something else.
It is not a mere “concern”, it’s the crux of problem really. What people in the AI alignment community have been trying to do is, starting with some factual and “objective” description of the universe (such a program or a mathematical formula) and deriving counterfactuals. The way it’s supposed to work is, the agent needs to locate all copies of itself or things “logically correlated” with itself (whatever that means) in the program, and imagine it is controlling this part. But a rigorous definition of this that solves all standard decision theoretic scenarios was never found.
Instead of doing that, I suggest a solution of different nature. In quasi-Bayesian RL, the agent never arrives at a factual and objective description of the universe. Instead, it arrives at a subjective description which already includes counterfactuals. I then proceed to show that, in Newcomb-like scenarios, such agents receive optimal expected utility (i.e. the same expected utility promised by UDT).
Yeah, I agree that the objective descriptions can leave out vital information, such as how the information you know was acquired, which seems important for determining the counterfactuals.