An AI should certainly cooperate if it discovered that by chance its opposing AI had identical source code.
I read your paper and the two posts in your short sequence. Thanks for the links. I still think it’s very unlikely that one of the AIs in your original hypothetical (when they don’t examine each other’s source code) would do better by defecting.
I accept that if an opposing AI had a model of you that was just decent but not great, then there is some amount of logical connection there. What I haven’t seen is any argument about the shape of the graph of logical connection strength vs similarity of entities. I hypothesize that for any two humans who exist today, if you put them in a one shot PD, the logical connection is negligible.
Has anyone written specifically on how exactly to give weights to logical connections between similar but non-identical entities?
Has anyone written specifically on how exactly to give weights to logical connections between similar but non-identical entities?
Nope! That’s the open part of the problem :-) We don’t know how to build a decision network with logical nodes, and we don’t know how to propagate a “logical update” between nodes. (That is, we don’t have a good formalism of how changing one algorithm logically affects a related but non-identical algorithm.)
If we had the latter thing, we wouldn’t even need the “logical decision network”, because we could just ask “if I change the agent, how does that logically affect the universe?” (as both are algorithms); this idea is the basis of proof-based UDT (which tries to answer the problem by searching for proofs under the assumption “Agent()=a” for various actions). Proof based UDT has lots of problems of its own, though, and thinking about logical updates in logical graphs is a fine angle of approach.
Thanks. I had one question about your Toward Idealized Decision Theory paper.
I can’t say I fully understand UDT, but the ‘updateless’ part does seem very similar to the “act as if you had precommitted to any action that you’d have wanted to precommit to” core idea of NDT. It’s not clear to me that the super powerful UDT would make the wrong decision in the game where two players pick numbers between 0-10 and get payouts based on their pick and the total sum.
Wouldn’t the UDT reason as follows? “If my algorithm were such that I wouldn’t just pick 1 when the human player forced me into it by picking 9 (for instance maybe I always pick 5 in this game), then I may still have a reputation as a powerful predictor but it’s much more likely that I’d also have a reputation as an entity that can’t be bullied like this, so the human would be less likely to pick 9. That state of the world is better for me, so I shouldn’t be the type of agent that makes the greedy choice to pick 1 when I predict the human will pick 9.”
The argument in your paper seems to rely on the human assuming the UDT will reason like a CDT once it knows the human will pick 9.
the ‘updateless’ part does seem very similar to the “act as if you had precommitted to any action that you’d have wanted to precommit to” core idea of NDT
Yep, that’s a common intuition pump people use in order to understand the “updateless” part of UDT.
It’s not clear to me that the super powerful UDT would make the wrong decision in the game where two players pick numbers between 0-10
A proof-based UDT agent would—this follows from the definition of proof-based UDT. Intuitively, we surely want a decision theory that reasons as you said, but the question is, can you write down a decision algorithm that actually reasons like that?
Most people agree with you on the philosophy of how an idealized decision theory should act, but the hard part is formalizing a decision theory that actually does the right things. The difficult part isn’t in the philosophy, the difficult part is turning the philosophy into math :-)
An AI should certainly cooperate if it discovered that by chance its opposing AI had identical source code.
I read your paper and the two posts in your short sequence. Thanks for the links. I still think it’s very unlikely that one of the AIs in your original hypothetical (when they don’t examine each other’s source code) would do better by defecting.
I accept that if an opposing AI had a model of you that was just decent but not great, then there is some amount of logical connection there. What I haven’t seen is any argument about the shape of the graph of logical connection strength vs similarity of entities. I hypothesize that for any two humans who exist today, if you put them in a one shot PD, the logical connection is negligible.
Has anyone written specifically on how exactly to give weights to logical connections between similar but non-identical entities?
Nope! That’s the open part of the problem :-) We don’t know how to build a decision network with logical nodes, and we don’t know how to propagate a “logical update” between nodes. (That is, we don’t have a good formalism of how changing one algorithm logically affects a related but non-identical algorithm.)
If we had the latter thing, we wouldn’t even need the “logical decision network”, because we could just ask “if I change the agent, how does that logically affect the universe?” (as both are algorithms); this idea is the basis of proof-based UDT (which tries to answer the problem by searching for proofs under the assumption “Agent()=a” for various actions). Proof based UDT has lots of problems of its own, though, and thinking about logical updates in logical graphs is a fine angle of approach.
Thanks. I had one question about your Toward Idealized Decision Theory paper.
I can’t say I fully understand UDT, but the ‘updateless’ part does seem very similar to the “act as if you had precommitted to any action that you’d have wanted to precommit to” core idea of NDT. It’s not clear to me that the super powerful UDT would make the wrong decision in the game where two players pick numbers between 0-10 and get payouts based on their pick and the total sum.
Wouldn’t the UDT reason as follows? “If my algorithm were such that I wouldn’t just pick 1 when the human player forced me into it by picking 9 (for instance maybe I always pick 5 in this game), then I may still have a reputation as a powerful predictor but it’s much more likely that I’d also have a reputation as an entity that can’t be bullied like this, so the human would be less likely to pick 9. That state of the world is better for me, so I shouldn’t be the type of agent that makes the greedy choice to pick 1 when I predict the human will pick 9.”
The argument in your paper seems to rely on the human assuming the UDT will reason like a CDT once it knows the human will pick 9.
Yep, that’s a common intuition pump people use in order to understand the “updateless” part of UDT.
A proof-based UDT agent would—this follows from the definition of proof-based UDT. Intuitively, we surely want a decision theory that reasons as you said, but the question is, can you write down a decision algorithm that actually reasons like that?
Most people agree with you on the philosophy of how an idealized decision theory should act, but the hard part is formalizing a decision theory that actually does the right things. The difficult part isn’t in the philosophy, the difficult part is turning the philosophy into math :-)