Excellent explanation, congratulations! Sad I’ll have to miss the discussion.
Interlocutor: Neither option is plausible. If you update, you’re not dynamically consistent, and you face an incentive to modify into updatelessness. If you bound cross-branch entanglements in the prior, you need to explain why reality itself also bounds such entanglements, or else you’re simply advising people to be delusional.
You found yourself a very nice interlocutor. I think we truly cannot have our cake and eat it: either you update, making you susceptible to infohazards=traps (if they exist, and they might exist), or you don’t, making you entrenched forever. I think we need to stop dancing around this fact, recognize that a fully-general solution in the formalism is not possible, and instead look into the details of our particular case. Sure, our environment might be adversarially bad, traps might be everywhere. But under this uncertainty, which ways do we think are best to recognize and prevent traps (while updating on other things). This is kind of studying and predicting generalization: given my past observations, where do I think I will suddenly fall out of distribution (into a trap)?
Me: I’m not sure if that’s exactly the condition, but at least it motivates the idea that there’s some condition differentiating when we should be updateful vs updateless. I think uncertainty about “our own beliefs” is subtly wrong; it seems more like uncertainty about which beliefs we endorse.
This was very though-provoking, but unfortunately I still think this crashes head-on with the realization that, a priori and in full generality, we can’t differentiate between safe and unsafe updates. Indeed, why would we expect that no one will punish us by updating on “our own beliefs” or “which beliefs I endorse”? After all, that’s just one more part of reality (without a clear boundary separating it).
It sounds like you are correctly explaining that our choice of prior will be, in some important sense, arbitrary: we can’t know the correct one in advance, we always have to rely on extrapolating contingent past observations. But then, it seems like your reaction is still hoping that we can have our cake and eat it: “I will remain uncertain about which beliefs I endorse, and only later will I update on the fact that I am in this or that reality. If I’m in the Infinite Counterlogical Mugging… then I will just eventually change my prior because I noticed I’m in the bad world!”. But then again, why would we think this update is safe? That’s just not being updateless, and losing out on the strategic gains from not updating.
Since a solution doesn’t exist in full generality, I think we should pivot to more concrete work related to the “content” (our particular human priors and our particular environment) instead of the “formalism”. For example:
Conceptual or empirical work on which are the robust and safe ways to extract information from humans (Suddenly LLM pre-training becomes safety work)
Conceptual or empirical work on which actions or reasoning are more likely to unearth traps under different assumptions (although this work could unearth traps)
Compilation or observation of properties of our environment (our physical reality) that could have some weak signal on which kinds of moves are safe
Unavoidably, this will involve some philosophical / almost-ethical reflection about which worlds we care about and which ones we are willing to give up.
This was very though-provoking, but unfortunately I still think this crashes head-on with the realization that, a priori and in full generality, we can’t differentiate between safe and unsafe updates. Indeed, why would we expect that no one will punish us by updating on “our own beliefs” or “which beliefs I endorse”? After all, that’s just one more part of reality (without a clear boundary separating it).
I’m comfortable explicitly assuming this isn’t the case for nice clean decision-theoretic results, so long as it looks like the resulting decision theory also handles this possibility ‘somewhat sanely’.
It sounds like you are correctly explaining that our choice of prior will be, in some important sense, arbitrary: we can’t know the correct one in advance, we always have to rely on extrapolating contingent past observations. But then, it seems like your reaction is still hoping that we can have our cake and eat it: “I will remain uncertain about which beliefs I endorse, and only later will I update on the fact that I am in this or that reality. If I’m in the Infinite Counterlogical Mugging… then I will just eventually change my prior because I noticed I’m in the bad world!”. But then again, why would we think this update is safe? That’s just not being updateless, and losing out on the strategic gains from not updating.
My thinking is more that we should accept the offer finitely many times or some fraction of the times, so that we reap some of the gains from updatelessness while also ‘not sacrificing too much’ in particular branches.
That is: in this case at least it seems like there’s concrete reason to believe we can have some cake and eat some too.
Since a solution doesn’t exist in full generality, I think we should pivot to more concrete work related to the “content” (our particular human priors and our particular environment) instead of the “formalism”.
This content-work seems primarily aimed at discovering and navigating actual problems similar to the decision-theoretic examples I’m using in my arguments. I’m more interested in gaining insights about what sorts of AI designs humans should implement. IE, the specific decision problem I’m interested in doing work to help navigate is the tiling problem.
That is: in this case at least it seems like there’s concrete reason to believe we can have some cake and eat some too.
I disagree with this framing. Sure, if you have 5 different cakes, you can eat some and have some. But for any particular cake, you can’t do both. Similarly, if you face 5 (or infinitely many) identical decision problems, you can choose to be updateful in some of them (thus obtaining useful Value of Information, that increases your utility in some worlds), and updateless in others (thus obtaining useful strategic coherence, that increases your utility in other worlds). The fundamental dichotomy remains as sharp, and it’s misleading to imply we can surmount it. It’s great to discuss, given this dichotomy, which trade-offs we humans are more comfortable making. But I’ve felt this was obscured in many relevant conversations.
This content-work seems primarily aimed at discovering and navigating actual problems similar to the decision-theoretic examples I’m using in my arguments. I’m more interested in gaining insights about what sorts of AI designs humans should implement. IE, the specific decision problem I’m interested in doing work to help navigate is the tiling problem.
My point is that the theoretical work you are shooting for is so general that it’s closer to “what sorts of AI designs (priors and decision theories) should always be implemented”, rather than “what sorts of AI designs should humans in particular, in this particular environment, implement”. And I think we won’t gain insights on the former, because there are no general solutions, due to fundamental trade-offs (“no-free-lunchs”). I think we could gain many insights on the former, but that the methods better fit for that are less formal/theoretical and way messier/”eye-balling”/iterating.
I disagree with this framing. Sure, if you have 5 different cakes, you can eat some and have some. But for any particular cake, you can’t do both. Similarly, if you face 5 (or infinitely many) identical decision problems, you can choose to be updateful in some of them (thus obtaining useful Value of Information, that increases your utility in some worlds), and updateless in others (thus obtaining useful strategic coherence, that increases your utility in other worlds). The fundamental dichotomy remains as sharp, and it’s misleading to imply we can surmount it. It’s great to discuss, given this dichotomy, which trade-offs we humans are more comfortable making. But I’ve felt this was obscured in many relevant conversations.
I don’t get your disagreement. If your view is that you can’t eat one cake and keep it too, and my view is that you can eat some cakes and keep other cakes, isn’t the obvious conclusion that these two views are compatible?
I would also argue that you can slice up a cake and keep some slices but eat others (this corresponds to mixed strategies), but this feels like splitting hairs rather than getting at some big important thing. My view is mainly about iterated situations (more than one cake).
Maybe your disagreement would be better stated in a way that didn’t lean on the cake analogy?
My point is that the theoretical work you are shooting for is so general that it’s closer to “what sorts of AI designs (priors and decision theories) should always be implemented”, rather than “what sorts of AI designs should humans in particular, in this particular environment, implement”. And I think we won’t gain insights on the former, because there are no general solutions, due to fundamental trade-offs (“no-free-lunchs”). I think we could gain many insights on the former, but that the methods better fit for that are less formal/theoretical and way messier/”eye-balling”/iterating.
Well, one way to continue this debate would be to discuss the concrete promising-ness of the pseudo-formalisms discussed in the post. I think there are some promising-seeming directions.
Another way to continue the debate would be to discuss theoretically whether theoretical work can be useful.
It sort of seems like your point is that theoretical work always needs to be predicated on simplifying assumptions. I agree with this, but I don’t think it makes theoretical work useless. My belief is that we should continue working to make the assumptions more and more realistic, but the ‘essential picture’ is often preserved under this operation. (EG, Newtonian gravity and general relativity make most of the same predictions in practice. Kolmogorov axioms vindicated a lot of earlier work on probability theory.)
Excellent explanation, congratulations! Sad I’ll have to miss the discussion.
You found yourself a very nice interlocutor. I think we truly cannot have our cake and eat it: either you update, making you susceptible to infohazards=traps (if they exist, and they might exist), or you don’t, making you entrenched forever. I think we need to stop dancing around this fact, recognize that a fully-general solution in the formalism is not possible, and instead look into the details of our particular case. Sure, our environment might be adversarially bad, traps might be everywhere. But under this uncertainty, which ways do we think are best to recognize and prevent traps (while updating on other things). This is kind of studying and predicting generalization: given my past observations, where do I think I will suddenly fall out of distribution (into a trap)?
This was very though-provoking, but unfortunately I still think this crashes head-on with the realization that, a priori and in full generality, we can’t differentiate between safe and unsafe updates. Indeed, why would we expect that no one will punish us by updating on “our own beliefs” or “which beliefs I endorse”? After all, that’s just one more part of reality (without a clear boundary separating it).
It sounds like you are correctly explaining that our choice of prior will be, in some important sense, arbitrary: we can’t know the correct one in advance, we always have to rely on extrapolating contingent past observations.
But then, it seems like your reaction is still hoping that we can have our cake and eat it: “I will remain uncertain about which beliefs I endorse, and only later will I update on the fact that I am in this or that reality. If I’m in the Infinite Counterlogical Mugging… then I will just eventually change my prior because I noticed I’m in the bad world!”. But then again, why would we think this update is safe? That’s just not being updateless, and losing out on the strategic gains from not updating.
Since a solution doesn’t exist in full generality, I think we should pivot to more concrete work related to the “content” (our particular human priors and our particular environment) instead of the “formalism”. For example:
Conceptual or empirical work on which are the robust and safe ways to extract information from humans (Suddenly LLM pre-training becomes safety work)
Conceptual or empirical work on which actions or reasoning are more likely to unearth traps under different assumptions (although this work could unearth traps)
Compilation or observation of properties of our environment (our physical reality) that could have some weak signal on which kinds of moves are safe
Unavoidably, this will involve some philosophical / almost-ethical reflection about which worlds we care about and which ones we are willing to give up.
I’m comfortable explicitly assuming this isn’t the case for nice clean decision-theoretic results, so long as it looks like the resulting decision theory also handles this possibility ‘somewhat sanely’.
My thinking is more that we should accept the offer finitely many times or some fraction of the times, so that we reap some of the gains from updatelessness while also ‘not sacrificing too much’ in particular branches.
That is: in this case at least it seems like there’s concrete reason to believe we can have some cake and eat some too.
This content-work seems primarily aimed at discovering and navigating actual problems similar to the decision-theoretic examples I’m using in my arguments. I’m more interested in gaining insights about what sorts of AI designs humans should implement. IE, the specific decision problem I’m interested in doing work to help navigate is the tiling problem.
I disagree with this framing. Sure, if you have 5 different cakes, you can eat some and have some. But for any particular cake, you can’t do both. Similarly, if you face 5 (or infinitely many) identical decision problems, you can choose to be updateful in some of them (thus obtaining useful Value of Information, that increases your utility in some worlds), and updateless in others (thus obtaining useful strategic coherence, that increases your utility in other worlds). The fundamental dichotomy remains as sharp, and it’s misleading to imply we can surmount it. It’s great to discuss, given this dichotomy, which trade-offs we humans are more comfortable making. But I’ve felt this was obscured in many relevant conversations.
My point is that the theoretical work you are shooting for is so general that it’s closer to “what sorts of AI designs (priors and decision theories) should always be implemented”, rather than “what sorts of AI designs should humans in particular, in this particular environment, implement”.
And I think we won’t gain insights on the former, because there are no general solutions, due to fundamental trade-offs (“no-free-lunchs”).
I think we could gain many insights on the former, but that the methods better fit for that are less formal/theoretical and way messier/”eye-balling”/iterating.
I don’t get your disagreement. If your view is that you can’t eat one cake and keep it too, and my view is that you can eat some cakes and keep other cakes, isn’t the obvious conclusion that these two views are compatible?
I would also argue that you can slice up a cake and keep some slices but eat others (this corresponds to mixed strategies), but this feels like splitting hairs rather than getting at some big important thing. My view is mainly about iterated situations (more than one cake).
Maybe your disagreement would be better stated in a way that didn’t lean on the cake analogy?
Well, one way to continue this debate would be to discuss the concrete promising-ness of the pseudo-formalisms discussed in the post. I think there are some promising-seeming directions.
Another way to continue the debate would be to discuss theoretically whether theoretical work can be useful.
It sort of seems like your point is that theoretical work always needs to be predicated on simplifying assumptions. I agree with this, but I don’t think it makes theoretical work useless. My belief is that we should continue working to make the assumptions more and more realistic, but the ‘essential picture’ is often preserved under this operation. (EG, Newtonian gravity and general relativity make most of the same predictions in practice. Kolmogorov axioms vindicated a lot of earlier work on probability theory.)