It’s certainly not looking very likely (> 80%) that … in causal interactions [most superintelligences] can easily and “fresh-out-of-the-box” coordinate on Pareto optimality (like performing logical or value handshakes) without falling into commitment races.
What are some obstacles to superintelligences performing effective logical handshakes? Or equivalently, what are some necessary conditions that seem difficult to bring about, even for very smart software systems?
(My understanding of the term “logical handshake” is as a generalization of the technique from the Robust Cooperation paper. Something like “I have a model of the other relevant decision-makers, and I will enact my part of the joint policy ϕ if I’m sufficiently confident that they’ll all enact their part of ϕ.” Is that the sort of decision-procedure that seems likely to fall into commitment races?)
This is exactly the kind of procedure which might get hindered by commitment races, because it involves “thinking more about what the other agents will do”, and the point of commitment races is that sometimes (and depending on your beliefs) this can seem net-negative ex ante (that is, before actually doing the thinking).
Of course, this doesn’t prohibit logical handshakes from being enacted sometimes. For example, if all agents start with a high enough prior on others enacting their part of ϕ, then they will do it. More realistically, it probably won’t be as easy as this, but if it is the case that all agents feel safe enough thinking about ϕ (they deem it unlikely this backfires into losing bargaining power), and/or the upshot is sufficiently high (when multiplied by the probability and so on), then all agents will deem it net-positive to think more about ϕ and the others, and eventually they’ll implement it.
So it comes down to how likely we think are priors (or the equivalent thing for AIs) which successfully fall into this coordination basin, opposed to priors which get stuck in some earlier prior without wanting to think more. And again, we have a few pro tanto reasons to expect coordination to be viable (and a few in the other direction). I do think out of my list of statements, logical handshakes in causal interactions might be one of the most likely ones.
To feed back, it sounds like “thinking more about what other agents will do” can be infohazardous to some decision theories. In the sense that they sometimes handle that sort of logical information in a way that produces worse results than if they didn’t have that logical information in the first place. They can sometimes regret thinking more.
It seems like it should always be possible to structure our software systems so that this doesn’t happen. I think this comes at the cost of not always best-responding to other agents’ policies.
In the example of Chicken, I think that looks like first trying to coordinate on a correlated strategy, like a 50⁄50 mix of (Straight, Swerve) and (Swerve, Straight). (First try to coordinate on a socially optimal joint policy.)
Supposing that failed, our software system could attempt to troubleshoot why, and discover that their counterpart has simply pre-committed to always going Straight. Upon learning that logical fact, I don’t think the best response is to best-respond, i.e. Swerve. If we’re playing True Chicken, it seems like in this case we should go Straight with enough probability that our counterpart regrets not thinking more and coordinating with us.
You’re right that (a priori and on the abstract) “bargaining power” fundamentally trades off against “best-responding”. That’s exactly the point of my post. This doesn’t prohibit, though, that a lot of pragmatic and realistic improvements are possible (because we know agents in our reality tend to think like this or like that), even if the theoretical trade-off can never be erased completely or in all situations and for all priors.
Your latter discussion is a normative one. And while I share your normative intuitions that best-responding completely (being completely updateful) is not always the best to do in realistic situations, I do have quibbles with this kind of discourse (similar to this). For example, why would I want to go Straight even after I have learned the other does? Out of some terminal valuation of fairness, or counterfactuals, more than anything, I think (more here). Or similarly, why should I think sticking to my notion of fairness shall ex ante convince the other player to coordinate on it, as opposed to the other player trying to pull out some “even more meta” move, like punishing notions of fairness that are not close enough to theirs? Again, all of this will depend on our priors.
I agree with Caspar’s point in the article you linked: the choice of metric determines which decision theories score highly on it. The metric that I think points towards “going Straight sometimes, even after observing that your counterpart has pre-committed to always going Straight” is a strategic one. If Alice and Bob are writing programs to play open-source Chicken on their behalf, then there’s a program equilibrium where:
Both programs first try to perform a logical handshake, coordinating on a socially optimal joint policy.
This only succeeds if they have compatible notions of social optimality.
As a fallback, Alice’s program adopts a policy which
Minus an extra penalty, to give Bob an incentive gradient to climb towards what Alice sees as the socially optimal joint policy
Otherwise maximizes Alice’s payoff, given that incentive-shaping constraint
Bob’s fallback operates symmetrically, with respect to his notion of social optimality.
The motivating principle is to treat one’s choice of decision theory as itself strategic. If Alice chooses a decision theory which never goes Straight, after making the logical observation that Bob’s decision theory always goes Straight, then Bob’s best response is to pick a decision theory that always goes straight and make that as obvious as possible to Alice’s decision theory.
Whereas if Alice designs her decision theory to grant Bob the highest payoff when his decision theory legibly outputs Bob’s part of ϕA (what Alice sees as a socially optimal joint policy), then Bob’s best response is to pick a decision theory that outputs Bob’s part of ϕA and make that as obvious as possible to Alice’s decision theory.
It seems like one general recipe for avoiding commitment races would be something like:
Design your decision theory so that no information is hazardous to it
We should never be willing to pay in order to not know certain implications of our beliefs, or true information about the world
Design your decision theory so that it is not infohazardous to sensible decision theories
Our counterparts should generally expect to benefit from reasoning more about us, because we legibly are trying to coordinate on good outcomes and we grant the highest payoffs to those that coordinate with us
Do all the reasoning you want about your counterpart’s decision theory
It’s fine to learn that your counterpart has pre-committed to going Straight. What’s true is already so. Learning this doesn’t force you to Swerve.
Plus, things might not be so bad! You might be a hypothetical inside your counterpart’s mind, considering how you would react to learning that they’ve pre-committed to going Straight.
Your actions in this scenario can determine whether it becomes factual or counterfactual. Being willing to crash into bullies can discourage them from trying to bully you into Swerving in the first place.
You might also discover good news about your counterpart, like that they’re also implementing your decision theory.
If this were bad news, like for commitment-racers, we’d want to rethink our decision theory.
The motivating principle is to treat one’s choice of decision theory as itself strategic.
I share the intuition that this lens is important. Indeed, there might be some important quantitative differences between a) I have a well-defined decision theory, and am choosing how to build my successor and b) I’m doing some vague normative reasoning to choose a decision theory (like we’re doing right now), but I think these differences are mostly contingent, and the same fundamental dynamics about strategicness are at play in both scenarios.
Design your decision theory so that no information is hazardous to it
I think this is equivalent to your decision theory being dynamically stable (that is, its performance never improves by having access to commitments), and I’m pretty sure the only way to attain this is complete updatelessness (which is bad).
That said, again, it might perfectly be that given our prior, many parts of cooperation-relevant concept-space seem very safe to explore, and so “for all practical purposes” some decision procedures are basically completely safe, and we’re able to use them to coordinate with all agents (even if we haven’t “solved in all prior-independent generality” the fundamental trade-off between updatelessness and updatefulness).
Got it, I think I understand better the problem you’re trying to solve! It’s not just being able to design a particular software system and give it good priors, it’s also finding a framework that’s robust to our initial choice of priors.
Is it possible for all possible priors to converge on optimal behavior, even given unlimited observations? I’m thinking of Yudkowsky’s example of the anti-Occamian and anti-Laplacian priors: the more observations an anti-Laplacian agent makes, the further its beliefs go from the truth.
I’m also surprised that dynamic stability leads to suboptimal outcomes that are predictable in advance. Intuitively, it seems like this should never happen.
Is it possible for all possible priors to converge on optimal behavior, even given unlimited observations?
Certainly not, in the most general case, as you correctly point out.
Here I was studying a particular case: updateless agents in a world remotely looking like the real world. And even more particular: thinking about the kinds of priors that superintelligences created in the real world might actually have.
Eliezer believes that, in these particular cases, it’s very likely we will get optimal behavior (we won’t get trapped priors, nor commitment races). I disagree, and that’s what I argue in the post.
I’m also surprised that dynamic stability leads to suboptimal outcomes that are predictable in advance. Intuitively, it seems like this should never happen.
If by “predictable in advance” you mean “from the updateless agent’s prior”, then nope! Updatelessness maximizes EV from the prior, so it will do whatever looks best from this perspective. If that’s what you want, then updatelessness is for you! The problem is, we have many pro tanto reasons to think this is not a good representation of rational decision-making in reality, nor the kind of cognition that survives for long in reality. Because of considerations about “the world being so complex that your prior will be missing a lot of stuff”. And in particular, multi-agentic scenarios are something that makes this complexity sky-rocket. Of course, you can say “but that consideration will also be included in your prior”. And that does make the situation better. But eventually your prior needs to end. And I argue, that’s much before you have all the necessary information to confidently commit to something forever (but other people might disagree with this).
It seems like trapped priors and commitment races are exactly the sort of cognitive dysfunction that updatelessness would solve in generality.
My understanding is that trapped priors are a symptom of a dysfunctional epistemology, which over-weights prior beliefs when updating on new observations. This results in an agent getting stuck, or even getting more and more confident in their initial position, regardless of what observations they actually make.
Similarly, commitment races are the result of dysfunctional reasoning that regards accurate information about other agents as hazardous. It seems like the consensus is that updatelessness is the general solution to infohazards.
My current model of an “updateless decision procedure”, approximated on a real computer, is something like “a policy which is continuously optimized, as an agent has more time to think, and the agent always acts according to the best policy it’s found so far.” And I like the model you use in your report, where an ecosystem of participants collectively optimize a data structure used to make decisions.
Since updateless agents use a fixed optimization criterion for evaluating policies, we can use something like an optimization market to optimize an agent’s policy. It seems easy to code up traders that identify “policies produced by (approximations of) Bayesian reasoning”, which I suspect won’t be subject to trapped priors.
So updateless agents seem like they should be able to do at least as well as updateful agents. Because they can identify updateful policies, and use those if they seem optimal. But they can also use different reasoning to identify policies like “pay Paul Ekman to drive you out of the desert”, and automatically adopt those when they lead to higher EV than updateful policies.
I suspect that the generalization of updatelessness to multi-agent scenarios will involve optimizing over the joint policy space, using a social choice theory to score joint policies. If agents agree at the meta level about “how conflicts of interest should be resolved”, then that seems like a plausible route for them to coordinate on socially optimal joint policies.
I think this approach also avoids the sky-rocketing complexity problem, if I understand the problem you’re pointing to. (I think the problem you’re pointing to involves trying to best-respond to another agent’s cognition, which gets more difficult as that agent becomes more complicated.)
What are some obstacles to superintelligences performing effective logical handshakes? Or equivalently, what are some necessary conditions that seem difficult to bring about, even for very smart software systems?
(My understanding of the term “logical handshake” is as a generalization of the technique from the Robust Cooperation paper. Something like “I have a model of the other relevant decision-makers, and I will enact my part of the joint policy ϕ if I’m sufficiently confident that they’ll all enact their part of ϕ.” Is that the sort of decision-procedure that seems likely to fall into commitment races?)
This is exactly the kind of procedure which might get hindered by commitment races, because it involves “thinking more about what the other agents will do”, and the point of commitment races is that sometimes (and depending on your beliefs) this can seem net-negative ex ante (that is, before actually doing the thinking).
Of course, this doesn’t prohibit logical handshakes from being enacted sometimes. For example, if all agents start with a high enough prior on others enacting their part of ϕ, then they will do it. More realistically, it probably won’t be as easy as this, but if it is the case that all agents feel safe enough thinking about ϕ (they deem it unlikely this backfires into losing bargaining power), and/or the upshot is sufficiently high (when multiplied by the probability and so on), then all agents will deem it net-positive to think more about ϕ and the others, and eventually they’ll implement it.
So it comes down to how likely we think are priors (or the equivalent thing for AIs) which successfully fall into this coordination basin, opposed to priors which get stuck in some earlier prior without wanting to think more. And again, we have a few pro tanto reasons to expect coordination to be viable (and a few in the other direction). I do think out of my list of statements, logical handshakes in causal interactions might be one of the most likely ones.
To feed back, it sounds like “thinking more about what other agents will do” can be infohazardous to some decision theories. In the sense that they sometimes handle that sort of logical information in a way that produces worse results than if they didn’t have that logical information in the first place. They can sometimes regret thinking more.
It seems like it should always be possible to structure our software systems so that this doesn’t happen. I think this comes at the cost of not always best-responding to other agents’ policies.
In the example of Chicken, I think that looks like first trying to coordinate on a correlated strategy, like a 50⁄50 mix of (Straight, Swerve) and (Swerve, Straight). (First try to coordinate on a socially optimal joint policy.)
Supposing that failed, our software system could attempt to troubleshoot why, and discover that their counterpart has simply pre-committed to always going Straight. Upon learning that logical fact, I don’t think the best response is to best-respond, i.e. Swerve. If we’re playing True Chicken, it seems like in this case we should go Straight with enough probability that our counterpart regrets not thinking more and coordinating with us.
You’re right that (a priori and on the abstract) “bargaining power” fundamentally trades off against “best-responding”. That’s exactly the point of my post. This doesn’t prohibit, though, that a lot of pragmatic and realistic improvements are possible (because we know agents in our reality tend to think like this or like that), even if the theoretical trade-off can never be erased completely or in all situations and for all priors.
Your latter discussion is a normative one. And while I share your normative intuitions that best-responding completely (being completely updateful) is not always the best to do in realistic situations, I do have quibbles with this kind of discourse (similar to this). For example, why would I want to go Straight even after I have learned the other does? Out of some terminal valuation of fairness, or counterfactuals, more than anything, I think (more here). Or similarly, why should I think sticking to my notion of fairness shall ex ante convince the other player to coordinate on it, as opposed to the other player trying to pull out some “even more meta” move, like punishing notions of fairness that are not close enough to theirs? Again, all of this will depend on our priors.
It sounds like we already mostly agree!
I agree with Caspar’s point in the article you linked: the choice of metric determines which decision theories score highly on it. The metric that I think points towards “going Straight sometimes, even after observing that your counterpart has pre-committed to always going Straight” is a strategic one. If Alice and Bob are writing programs to play open-source Chicken on their behalf, then there’s a program equilibrium where:
Both programs first try to perform a logical handshake, coordinating on a socially optimal joint policy.
This only succeeds if they have compatible notions of social optimality.
As a fallback, Alice’s program adopts a policy which
Caps Bob’s expected payoff at what Bob would have received under Alice’s notion of social optimality
Minus an extra penalty, to give Bob an incentive gradient to climb towards what Alice sees as the socially optimal joint policy
Otherwise maximizes Alice’s payoff, given that incentive-shaping constraint
Bob’s fallback operates symmetrically, with respect to his notion of social optimality.
The motivating principle is to treat one’s choice of decision theory as itself strategic. If Alice chooses a decision theory which never goes Straight, after making the logical observation that Bob’s decision theory always goes Straight, then Bob’s best response is to pick a decision theory that always goes straight and make that as obvious as possible to Alice’s decision theory.
Whereas if Alice designs her decision theory to grant Bob the highest payoff when his decision theory legibly outputs Bob’s part of ϕA (what Alice sees as a socially optimal joint policy), then Bob’s best response is to pick a decision theory that outputs Bob’s part of ϕA and make that as obvious as possible to Alice’s decision theory.
It seems like one general recipe for avoiding commitment races would be something like:
Design your decision theory so that no information is hazardous to it
We should never be willing to pay in order to not know certain implications of our beliefs, or true information about the world
Design your decision theory so that it is not infohazardous to sensible decision theories
Our counterparts should generally expect to benefit from reasoning more about us, because we legibly are trying to coordinate on good outcomes and we grant the highest payoffs to those that coordinate with us
If infohazard resistance is straightforward, then our counterpart should hopefully have that reflected in their prior.
Do all the reasoning you want about your counterpart’s decision theory
It’s fine to learn that your counterpart has pre-committed to going Straight. What’s true is already so. Learning this doesn’t force you to Swerve.
Plus, things might not be so bad! You might be a hypothetical inside your counterpart’s mind, considering how you would react to learning that they’ve pre-committed to going Straight.
Your actions in this scenario can determine whether it becomes factual or counterfactual. Being willing to crash into bullies can discourage them from trying to bully you into Swerving in the first place.
You might also discover good news about your counterpart, like that they’re also implementing your decision theory.
If this were bad news, like for commitment-racers, we’d want to rethink our decision theory.
I share the intuition that this lens is important. Indeed, there might be some important quantitative differences between
a) I have a well-defined decision theory, and am choosing how to build my successor
and
b) I’m doing some vague normative reasoning to choose a decision theory (like we’re doing right now),
but I think these differences are mostly contingent, and the same fundamental dynamics about strategicness are at play in both scenarios.
I think this is equivalent to your decision theory being dynamically stable (that is, its performance never improves by having access to commitments), and I’m pretty sure the only way to attain this is complete updatelessness (which is bad).
That said, again, it might perfectly be that given our prior, many parts of cooperation-relevant concept-space seem very safe to explore, and so “for all practical purposes” some decision procedures are basically completely safe, and we’re able to use them to coordinate with all agents (even if we haven’t “solved in all prior-independent generality” the fundamental trade-off between updatelessness and updatefulness).
Got it, I think I understand better the problem you’re trying to solve! It’s not just being able to design a particular software system and give it good priors, it’s also finding a framework that’s robust to our initial choice of priors.
Is it possible for all possible priors to converge on optimal behavior, even given unlimited observations? I’m thinking of Yudkowsky’s example of the anti-Occamian and anti-Laplacian priors: the more observations an anti-Laplacian agent makes, the further its beliefs go from the truth.
I’m also surprised that dynamic stability leads to suboptimal outcomes that are predictable in advance. Intuitively, it seems like this should never happen.
Certainly not, in the most general case, as you correctly point out.
Here I was studying a particular case: updateless agents in a world remotely looking like the real world. And even more particular: thinking about the kinds of priors that superintelligences created in the real world might actually have.
Eliezer believes that, in these particular cases, it’s very likely we will get optimal behavior (we won’t get trapped priors, nor commitment races). I disagree, and that’s what I argue in the post.
If by “predictable in advance” you mean “from the updateless agent’s prior”, then nope! Updatelessness maximizes EV from the prior, so it will do whatever looks best from this perspective. If that’s what you want, then updatelessness is for you! The problem is, we have many pro tanto reasons to think this is not a good representation of rational decision-making in reality, nor the kind of cognition that survives for long in reality. Because of considerations about “the world being so complex that your prior will be missing a lot of stuff”. And in particular, multi-agentic scenarios are something that makes this complexity sky-rocket.
Of course, you can say “but that consideration will also be included in your prior”. And that does make the situation better. But eventually your prior needs to end. And I argue, that’s much before you have all the necessary information to confidently commit to something forever (but other people might disagree with this).
Got it, thank you!
It seems like trapped priors and commitment races are exactly the sort of cognitive dysfunction that updatelessness would solve in generality.
My understanding is that trapped priors are a symptom of a dysfunctional epistemology, which over-weights prior beliefs when updating on new observations. This results in an agent getting stuck, or even getting more and more confident in their initial position, regardless of what observations they actually make.
Similarly, commitment races are the result of dysfunctional reasoning that regards accurate information about other agents as hazardous. It seems like the consensus is that updatelessness is the general solution to infohazards.
My current model of an “updateless decision procedure”, approximated on a real computer, is something like “a policy which is continuously optimized, as an agent has more time to think, and the agent always acts according to the best policy it’s found so far.” And I like the model you use in your report, where an ecosystem of participants collectively optimize a data structure used to make decisions.
Since updateless agents use a fixed optimization criterion for evaluating policies, we can use something like an optimization market to optimize an agent’s policy. It seems easy to code up traders that identify “policies produced by (approximations of) Bayesian reasoning”, which I suspect won’t be subject to trapped priors.
So updateless agents seem like they should be able to do at least as well as updateful agents. Because they can identify updateful policies, and use those if they seem optimal. But they can also use different reasoning to identify policies like “pay Paul Ekman to drive you out of the desert”, and automatically adopt those when they lead to higher EV than updateful policies.
I suspect that the generalization of updatelessness to multi-agent scenarios will involve optimizing over the joint policy space, using a social choice theory to score joint policies. If agents agree at the meta level about “how conflicts of interest should be resolved”, then that seems like a plausible route for them to coordinate on socially optimal joint policies.
I think this approach also avoids the sky-rocketing complexity problem, if I understand the problem you’re pointing to. (I think the problem you’re pointing to involves trying to best-respond to another agent’s cognition, which gets more difficult as that agent becomes more complicated.)