www.jimbuhler.site
Jim Buhler
But I overall think working on alignment is largely more urgent. Being able to understand what’s going on at all inside a neural net, and advocating that companies be required to understand what’s going on before developing new/bigger/better models, seems like a convergent goal relevant to both human extinction and astronomical suffering.
Fwiw, Lukas’s comment link to a post arguing against that and I 100% agree with it. I think the “Alignment will solve s-risks as well anyway” is one the most untrue and harmful widespread memes in the EA/LW community.
“aesthetically”?
[Question] Is the fact that we don’t observe any obvious glitch evidence that we’re not in a simulation?
Interesting! Did thinking about those variants make you update your credences in SIA/SSA (or else)?
(Btw, maybe it’s worth adding the motivation for thinking about these problems in the intro of the post.) :)
Thanks a lot for these comments, Oscar! :)
I think something can’t be both neat and so vague as to use a word like ‘significant’.
I forgot to copy-paste a footnote clarifying that “as made explicit in the Appendix, what “significant” exactly means depends on the payoffs of the game”! Fixed. I agree this is vague, still, although I guess it has to be since the payoffs are unspecified?
In the EDT section of Perfect-copy PD, you replace some p’s with q’s and vice versa, but not all, is there a principled reason for this? Maybe it is just a mistake and it should be U_Alice(p)=4p-pp-p+1=1+3p-p^2 and U_Bob(q) = 4q-qq-q+1 = 1+3q-q^2.
Also a copy-pasting mistake. Thanks for catching it! :)
I am unconvinced of the utility of the concept of compatible decision theories. In my mind I am just thinking of it as ‘entanglement can only happen if both players use decisions that allow for superrationality’. I am worried your framing would imply that two CDT players are entangled, when I think they are not, they just happen to both always defect.
This may be an unimportant detail, but—interestingly—I opted for this concept of “compatible DT” precisely because I wanted to imply that two CDT players may be decision-entangled! Say CDT-agent David plays a PD against a perfect copy of himself. Their decisions to defect are entangled, right? Whatever David does, his copy does the same (although David sort of “ignores” that when he makes his decision). David is very unlikely to be decision-entangled with any random CDT agent, however (in that case, the mutual defection is just a “coincidence” and is not due to some dependence between their respective reasoning/choices). I didn’t mean the concept of “decision-entanglement” to pre-assume superrationality. I want CDT-David to agree/admit that he is decision-entangled with his perfect copy. Nonetheless, since he doesn’t buy superrationality, I know that he won’t factor the decision-entanglement into his expected value optimization (he won’t “factor in the possibility that p=q”.) That’s why you need significant credence in both decision-entanglement and superrationality to get cooperation, here. :)
Also, if decision-entanglement is an objective feature of the world, then I would think it shouldn’t depend on what decision theory I personally hold. I could be CDTer who happens to have a perfect copy and so be decision-entangeled, while still refusing to believe in superrationality.
Agreed, but if you’re CDTer, you can’t be decision-entangled with an EDTer, right? Say you’re both told you’re decision-entangled. What happens? Well, you don’t care so you still defect while EDTer cooperates. Different decisions. So… you two weren’t entangled after all. The person who told you you were was mistaken.
So yes, decision-entanglement can’t depend on your DT per se, but doesn’t it have to depend on its “compatibility” with the other’s for there to be any dependence between your algos/choices? How could a CDTer and an EDTer be decision-entangled in a PD?
Not very confident about my answers. Feel free to object. :) And thanks for making me rethink my assumptions/definitions!
Oh nice, thanks for this! I think I now see much more clearly why we’re both confused about what the other thinks.
(I’ll respond using my definitions/framing which you don’t share, so you might find this confusing, but hopefully, you’ll understand what I mean and agree although you would frame/explain things very differently.)
Say Bob is CooperateBot. Alice may believe she’s decision-entangled with them, in which case she (subjectively) should cooperate, but that doesn’t mean that their decisions are logically dependent (i.e., that her belief is warranted). If Alice changes her decision and defects, Bob’s decision remains the same. So unless Alice is also a CooperateBot, her belief b (“my decision and Bob’s are logically dependent / entangled such that I must cooperate”) is wrong. There is no decision-entanglement. Just “coincidental” mutual cooperation. You can still argue that Alice should cooperate given that she believes b of course, but b is false. If only she could realize that, she would stop naively cooperating and get a higher payoff.
It matters what their beliefs are to know what they will do, but two agents believing their decisions are logically dependent doesn’t magically create logical dependency.
If I play a one-shot PD against you and we both believe we should cooperate, that doesn’t mean that we necessarily both defect in a counterfactual scenario where one of us believes they should defect (i.e., that doesn’t mean there is decision-entanglement / logical dependency, i.e., that doesn’t mean that our belief that we should cooperate is warranted, i.e., that doesn’t mean that we’re not two suckers cooperating for wrong reasons while we could be exploiting the other and avoid being exploited). And whether we necessarily both defect in a counterfactual scenario where one of us believes they should defect (i.e., whether we are decision-entangled) depends on how we came to our beliefs that our decisions are logically dependent and that we must cooperate (as illustrated—in a certain way—in my above figures).
After reading that, I’m really starting to think that we (at least mostly) agree but that we just use incompatible framings/definitions to explain things.
Fwiw, while I see how my framing can seem unnecessarily confusing, I think yours is usually used/interpreted oversimplistically (by you but also and especially by others) and is therefore extremely conducive to Motte-and-bailey fallacies[1] leading us to widely underestimate the fragility of decision-entanglement. I might be confused though, of course.
Thanks a lot for your comment! I think I understand you much better now and it helped me reclarify things in my mind. :)
E.g., it’s easy to argue that widely different agents may converge on the exact same DT, but not if you include intricacies like the one in your last paragraph.