From the top of my head, a (very unrealistic) scenario:
There’s a third world simulated by two disconnected agents who know about each other. There’s currently a variable to set in the description of this third world: the type of particle that will hit a planet in it in 1000 years and, depending on the type, color the sky in either green or purple color. Nothing else about the world simulation can be changed. This planet is full of people who don’t care about the sky color, but really care about being connected to their loved ones, and really wouldn’t want a version of their loved ones to exist in a disconnected world. The two agents both care about the preferences of the people in this world, but the first agent likes it when people see more green color, and the second agent likes it when people see more purple color. They would really want to coordinate on randomly setting a single color for both worlds instead of splitting it into two.
It’s possible that the other agent is created by some different agent who’s seen your source code and tried to design the other agent and its environment in a way that would result in your coordination mechanism ending up picking their preferred color. Can you design a coordination mechanism that’s not exploitable this way?
I dunno—my intuitions don’t really apply to distant and convoluted scenarios, so I’m deeply suspicious of any argument that I (or “an agent” that I care anything about) should do something non-obvious. I will say that’s not acausal—decisions clearly have causal effects. It’s non-communicative, which IS a problem, but a very different one. Thomas Schelling has things to say on that topic, and it’s quite likely that randomness is the wrong mechanism here, and finding ANY common knowledge can lead to finding a Schelling point that has a higher chance of both picking the same color.
I’m still very confused about the scenario. Agent A and B and their respective environments may have been designed as a proxy by adversarial agents C and D respectively? Both C and D care about coordinating with each other by more than they care about having the sky colour match their preference? A can simulate B + environment, but can’t simulate D (and vice versa)? Presumably this means that D can no longer affect B or B’s environment, otherwise A wouldn’t be able to simulate.
Critical information: Did either C or D know the design of the other’s proxy before designing their own? Did they both know the other’s design and settle on a mutually-agreeable pair of designs?
Assume you’re playing as agent A and assume you don’t have a parent agent. You’re trying to coordinate with agent B. You want to not be exploitable, even if agent B has a patent that picked B’s source code adversarially. Consider this a very local/isolated puzzle (this puzzle is not about trying to actually coordinate with all possible parents instead).
Oh then no, that’s obviously not possible. The parent can choose agent B to be a rock with “green” painted on it. The only way to coordinate with a rock is to read what’s painted on it.
Agent B wants to coordinate with you instead of being a rock; the question isn’t “can you always coordinate”, it’s “is there any coordination mechanism robust to adversarially designed counterparties”.
Trivially, you can coordinate with agents with identical architecture, that are different only in the utility functions, by picking the first bit of a hash of the question you want to coordinate on.
Oh, then I’m still confused. Agent B can want to coordinate with A but still be effectively a rock because they are guaranteed to pick the designer’s preferred option no matter what they see. Since agent A can analyze B’s source code arbitrarily powerfully they can determine this, and realize that the only option (if they want to coordinate) is to go along with that.
A’s algorithm can include “if my opponent is a rock, defect” but then we have different scenarios based on whether B’s designer gets to see A’s source code before designing B.
From the top of my head, a (very unrealistic) scenario:
There’s a third world simulated by two disconnected agents who know about each other. There’s currently a variable to set in the description of this third world: the type of particle that will hit a planet in it in 1000 years and, depending on the type, color the sky in either green or purple color. Nothing else about the world simulation can be changed. This planet is full of people who don’t care about the sky color, but really care about being connected to their loved ones, and really wouldn’t want a version of their loved ones to exist in a disconnected world. The two agents both care about the preferences of the people in this world, but the first agent likes it when people see more green color, and the second agent likes it when people see more purple color. They would really want to coordinate on randomly setting a single color for both worlds instead of splitting it into two.
It’s possible that the other agent is created by some different agent who’s seen your source code and tried to design the other agent and its environment in a way that would result in your coordination mechanism ending up picking their preferred color. Can you design a coordination mechanism that’s not exploitable this way?
I dunno—my intuitions don’t really apply to distant and convoluted scenarios, so I’m deeply suspicious of any argument that I (or “an agent” that I care anything about) should do something non-obvious. I will say that’s not acausal—decisions clearly have causal effects. It’s non-communicative, which IS a problem, but a very different one. Thomas Schelling has things to say on that topic, and it’s quite likely that randomness is the wrong mechanism here, and finding ANY common knowledge can lead to finding a Schelling point that has a higher chance of both picking the same color.
I’m still very confused about the scenario. Agent A and B and their respective environments may have been designed as a proxy by adversarial agents C and D respectively? Both C and D care about coordinating with each other by more than they care about having the sky colour match their preference? A can simulate B + environment, but can’t simulate D (and vice versa)? Presumably this means that D can no longer affect B or B’s environment, otherwise A wouldn’t be able to simulate.
Critical information: Did either C or D know the design of the other’s proxy before designing their own? Did they both know the other’s design and settle on a mutually-agreeable pair of designs?
Assume you’re playing as agent A and assume you don’t have a parent agent. You’re trying to coordinate with agent B. You want to not be exploitable, even if agent B has a patent that picked B’s source code adversarially. Consider this a very local/isolated puzzle (this puzzle is not about trying to actually coordinate with all possible parents instead).
Oh then no, that’s obviously not possible. The parent can choose agent B to be a rock with “green” painted on it. The only way to coordinate with a rock is to read what’s painted on it.
Agent B wants to coordinate with you instead of being a rock; the question isn’t “can you always coordinate”, it’s “is there any coordination mechanism robust to adversarially designed counterparties”.
Trivially, you can coordinate with agents with identical architecture, that are different only in the utility functions, by picking the first bit of a hash of the question you want to coordinate on.
Oh, then I’m still confused. Agent B can want to coordinate with A but still be effectively a rock because they are guaranteed to pick the designer’s preferred option no matter what they see. Since agent A can analyze B’s source code arbitrarily powerfully they can determine this, and realize that the only option (if they want to coordinate) is to go along with that.
A’s algorithm can include “if my opponent is a rock, defect” but then we have different scenarios based on whether B’s designer gets to see A’s source code before designing B.