I think defect is the right answer in your AI problem and therefore that NDT gets it right, but I’m aware lots of LWers think otherwise. I haven’t researched this enough to want to argue it, but is there a discussion you’d recommend I read that spells out the reasoning? Otherwise I’ll just look through LW posts on prisoner’s dilemmas.
Secondly, I’d like to try to somehow incorporate logical effects into NDT. I agree they’re important. Any suggestions for where I could find lots of examples of decision problems where logical effects matter, to help me think about the general case?
I think defect is the right answer in your AI problem and therefore that NDT gets it right
That’s surprising to me. Imagine that the situation is “prisoner’s dilemma with shared source code”, and that the AIs inspect each other’s source code and verify that (by some logical but non-causal miracle) they have exactly identical source code. Do you still think they do better to defect? I wouldn’t want to build an agent that defects in that situation :-p
An AI should certainly cooperate if it discovered that by chance its opposing AI had identical source code.
I read your paper and the two posts in your short sequence. Thanks for the links. I still think it’s very unlikely that one of the AIs in your original hypothetical (when they don’t examine each other’s source code) would do better by defecting.
I accept that if an opposing AI had a model of you that was just decent but not great, then there is some amount of logical connection there. What I haven’t seen is any argument about the shape of the graph of logical connection strength vs similarity of entities. I hypothesize that for any two humans who exist today, if you put them in a one shot PD, the logical connection is negligible.
Has anyone written specifically on how exactly to give weights to logical connections between similar but non-identical entities?
Has anyone written specifically on how exactly to give weights to logical connections between similar but non-identical entities?
Nope! That’s the open part of the problem :-) We don’t know how to build a decision network with logical nodes, and we don’t know how to propagate a “logical update” between nodes. (That is, we don’t have a good formalism of how changing one algorithm logically affects a related but non-identical algorithm.)
If we had the latter thing, we wouldn’t even need the “logical decision network”, because we could just ask “if I change the agent, how does that logically affect the universe?” (as both are algorithms); this idea is the basis of proof-based UDT (which tries to answer the problem by searching for proofs under the assumption “Agent()=a” for various actions). Proof based UDT has lots of problems of its own, though, and thinking about logical updates in logical graphs is a fine angle of approach.
Thanks. I had one question about your Toward Idealized Decision Theory paper.
I can’t say I fully understand UDT, but the ‘updateless’ part does seem very similar to the “act as if you had precommitted to any action that you’d have wanted to precommit to” core idea of NDT. It’s not clear to me that the super powerful UDT would make the wrong decision in the game where two players pick numbers between 0-10 and get payouts based on their pick and the total sum.
Wouldn’t the UDT reason as follows? “If my algorithm were such that I wouldn’t just pick 1 when the human player forced me into it by picking 9 (for instance maybe I always pick 5 in this game), then I may still have a reputation as a powerful predictor but it’s much more likely that I’d also have a reputation as an entity that can’t be bullied like this, so the human would be less likely to pick 9. That state of the world is better for me, so I shouldn’t be the type of agent that makes the greedy choice to pick 1 when I predict the human will pick 9.”
The argument in your paper seems to rely on the human assuming the UDT will reason like a CDT once it knows the human will pick 9.
the ‘updateless’ part does seem very similar to the “act as if you had precommitted to any action that you’d have wanted to precommit to” core idea of NDT
Yep, that’s a common intuition pump people use in order to understand the “updateless” part of UDT.
It’s not clear to me that the super powerful UDT would make the wrong decision in the game where two players pick numbers between 0-10
A proof-based UDT agent would—this follows from the definition of proof-based UDT. Intuitively, we surely want a decision theory that reasons as you said, but the question is, can you write down a decision algorithm that actually reasons like that?
Most people agree with you on the philosophy of how an idealized decision theory should act, but the hard part is formalizing a decision theory that actually does the right things. The difficult part isn’t in the philosophy, the difficult part is turning the philosophy into math :-)
I think defect is the right answer in your AI problem and therefore that NDT gets it right, but I’m aware lots of LWers think otherwise. I haven’t researched this enough to want to argue it, but is there a discussion you’d recommend I read that spells out the reasoning? Otherwise I’ll just look through LW posts on prisoner’s dilemmas.
Secondly, I’d like to try to somehow incorporate logical effects into NDT. I agree they’re important. Any suggestions for where I could find lots of examples of decision problems where logical effects matter, to help me think about the general case?
That’s surprising to me. Imagine that the situation is “prisoner’s dilemma with shared source code”, and that the AIs inspect each other’s source code and verify that (by some logical but non-causal miracle) they have exactly identical source code. Do you still think they do better to defect? I wouldn’t want to build an agent that defects in that situation :-p
The paper that jessicat linked in the parent post is a decent introduction to the notion of logical counterfactuals. See also the “Idealized Decision Theory” section of this annotated bibliography, and perhaps also this short sequence I wrote a while back.
An AI should certainly cooperate if it discovered that by chance its opposing AI had identical source code.
I read your paper and the two posts in your short sequence. Thanks for the links. I still think it’s very unlikely that one of the AIs in your original hypothetical (when they don’t examine each other’s source code) would do better by defecting.
I accept that if an opposing AI had a model of you that was just decent but not great, then there is some amount of logical connection there. What I haven’t seen is any argument about the shape of the graph of logical connection strength vs similarity of entities. I hypothesize that for any two humans who exist today, if you put them in a one shot PD, the logical connection is negligible.
Has anyone written specifically on how exactly to give weights to logical connections between similar but non-identical entities?
Nope! That’s the open part of the problem :-) We don’t know how to build a decision network with logical nodes, and we don’t know how to propagate a “logical update” between nodes. (That is, we don’t have a good formalism of how changing one algorithm logically affects a related but non-identical algorithm.)
If we had the latter thing, we wouldn’t even need the “logical decision network”, because we could just ask “if I change the agent, how does that logically affect the universe?” (as both are algorithms); this idea is the basis of proof-based UDT (which tries to answer the problem by searching for proofs under the assumption “Agent()=a” for various actions). Proof based UDT has lots of problems of its own, though, and thinking about logical updates in logical graphs is a fine angle of approach.
Thanks. I had one question about your Toward Idealized Decision Theory paper.
I can’t say I fully understand UDT, but the ‘updateless’ part does seem very similar to the “act as if you had precommitted to any action that you’d have wanted to precommit to” core idea of NDT. It’s not clear to me that the super powerful UDT would make the wrong decision in the game where two players pick numbers between 0-10 and get payouts based on their pick and the total sum.
Wouldn’t the UDT reason as follows? “If my algorithm were such that I wouldn’t just pick 1 when the human player forced me into it by picking 9 (for instance maybe I always pick 5 in this game), then I may still have a reputation as a powerful predictor but it’s much more likely that I’d also have a reputation as an entity that can’t be bullied like this, so the human would be less likely to pick 9. That state of the world is better for me, so I shouldn’t be the type of agent that makes the greedy choice to pick 1 when I predict the human will pick 9.”
The argument in your paper seems to rely on the human assuming the UDT will reason like a CDT once it knows the human will pick 9.
Yep, that’s a common intuition pump people use in order to understand the “updateless” part of UDT.
A proof-based UDT agent would—this follows from the definition of proof-based UDT. Intuitively, we surely want a decision theory that reasons as you said, but the question is, can you write down a decision algorithm that actually reasons like that?
Most people agree with you on the philosophy of how an idealized decision theory should act, but the hard part is formalizing a decision theory that actually does the right things. The difficult part isn’t in the philosophy, the difficult part is turning the philosophy into math :-)