If the other player is a stone with “Threat” written on it, you should do the same thing, even if it looks like the stone’s behavior doesn’t depend on what you’ll do in response. Responding to actions and ignoring the internals when threatened means you’ll get a lot fewer stones thrown at you.
In order to “do the same thing” you either need the other’s player’s payoffs, or according to the next section “If you receive a threat and know nothing about the other agent’s payoffs, simply don’t give in to the threat!” So if all you see is a stone, then presumably you don’t know the other agent’s payoffs, so presumably “do the same thing” means “don’t give in”.
But that doesn’t make sense because suppose you’re driving and suddenly a boulder rolls towards you. You’re going to “give in” and swerve, right? What if it’s an animal running towards you and you know they’re too dumb to do LDT-like reasoning or model your thoughts in their head, you’re also going to swerve, right? So there’s still a puzzle here where agents have an incentive to make themselves look like a stone (i.e., part of nature or not an agent), or to never use LDT or model others in any detail.
Another problem is, do you know how to formulate/formalize a version of LDT so that we can mathematically derive the game outcomes that you suggest here?
By a stone, I meant a player with very deterministic behavior in a game with known payoffs, named this way after the idea of cooperate-stones in prisoner’s dilemma (with known payoffs).
I think to the extent there’s no relationship between giving in to a boulder/implemeting some particular decision theory and having this and other boulders thrown at you, UDT and FDT by default swerve (and probably don’t consider the boulders to be threatening them, and it’s not very clear in what sense this is “giving in”); to the extent it sends more boulders their way, they don’t swerve.
If making decisions some way incentivizes other agents to become less like LDTs and more like uncooperative boulders, you can simply not make decisions that way. (If some agents actually have an ability to turn into animals and you can’t distinguish the causes behind an animal running at you, you can sometimes probabilistically take out your anti-animal gun and put them to sleep.)
Do you maybe have a realistic example where this would realistically be a problem?
I’d be moderately surprised if UDT/FDT consider something to be a better policy than what’s described in the post.
Edit: to add, LDTs don’t swerve to boulders that were created to influence the LDT agent’s responses. If you turn into a boulder because you expect some agents among all possible agents to swerve, this is a threat, and LDTs don’t give in to those boulders (and it doesn’t matter whether or not you tried to predict the behavior of LDTs in particular). If you believed LDT agents or agents in general would swerve against a boulder, and that made you become a boulder, LDT agents obviously don’t swerve to that boulder. They might swerve to boulders that are actually natural boulders caused by the very simple physics no one influenced to cause the agents to do something. They also pay their rent- because they’d be evicted otherwise, not for the reason of getting rent from them under the threat of eviction but for the reason of getting rent from someone else, and they’re sure there were no self-modifications to make it look this way.
If making decisions some way incentivizes other agents to become less like LDTs and more like uncooperative boulders, you can simply not make decisions that way.
Another way that those agents might handle the situation is not to become boulders themselves, but to send boulders to make the offer. That is, send a minion to present the offer without any authority to discuss terms. I believe this often happens in the real world, e.g. customer service staff whose main goal, for their own continued employment, is to send the aggrieved customer away empty-handed and never refer the call upwards.
A smart agent can simply make decisions like a negotiator with restrictions on the kinds of terms it can accept, without having to spawn a “boulder” to do that.
You can just do the correct thing, without having to separate yourself into parts that do things correctly and a part that tries to not look at the world and spawns correct-thing-doers.
In Parfit’s Hitchhiker, you can just pay once you’re there, without precommiting/rewriting yourself into an agent that pays. You can just do the thing that wins.
Some agents can’t do the things that win and would have to rewrite themselves into something better and still lose in some problems, but you can be an agent that wins, and gradient descent probably crystallizes something that wins into what is making the decisions in smart enough things.
Another problem is, do you know how to formulate/formalize a version of LDT so that we can mathematically derive the game outcomes that you suggest here?
In order to “do the same thing” you either need the other’s player’s payoffs, or according to the next section “If you receive a threat and know nothing about the other agent’s payoffs, simply don’t give in to the threat!” So if all you see is a stone, then presumably you don’t know the other agent’s payoffs, so presumably “do the same thing” means “don’t give in”.
But that doesn’t make sense because suppose you’re driving and suddenly a boulder rolls towards you. You’re going to “give in” and swerve, right? What if it’s an animal running towards you and you know they’re too dumb to do LDT-like reasoning or model your thoughts in their head, you’re also going to swerve, right? So there’s still a puzzle here where agents have an incentive to make themselves look like a stone (i.e., part of nature or not an agent), or to never use LDT or model others in any detail.
Another problem is, do you know how to formulate/formalize a version of LDT so that we can mathematically derive the game outcomes that you suggest here?
I recall Eliezer saying this was an open problem, at a party about a year ago.
By a stone, I meant a player with very deterministic behavior in a game with known payoffs, named this way after the idea of cooperate-stones in prisoner’s dilemma (with known payoffs).
I think to the extent there’s no relationship between giving in to a boulder/implemeting some particular decision theory and having this and other boulders thrown at you, UDT and FDT by default swerve (and probably don’t consider the boulders to be threatening them, and it’s not very clear in what sense this is “giving in”); to the extent it sends more boulders their way, they don’t swerve.
If making decisions some way incentivizes other agents to become less like LDTs and more like uncooperative boulders, you can simply not make decisions that way. (If some agents actually have an ability to turn into animals and you can’t distinguish the causes behind an animal running at you, you can sometimes probabilistically take out your anti-animal gun and put them to sleep.)
Do you maybe have a realistic example where this would realistically be a problem?
I’d be moderately surprised if UDT/FDT consider something to be a better policy than what’s described in the post.
Edit: to add, LDTs don’t swerve to boulders that were created to influence the LDT agent’s responses. If you turn into a boulder because you expect some agents among all possible agents to swerve, this is a threat, and LDTs don’t give in to those boulders (and it doesn’t matter whether or not you tried to predict the behavior of LDTs in particular). If you believed LDT agents or agents in general would swerve against a boulder, and that made you become a boulder, LDT agents obviously don’t swerve to that boulder. They might swerve to boulders that are actually natural boulders caused by the very simple physics no one influenced to cause the agents to do something. They also pay their rent- because they’d be evicted otherwise, not for the reason of getting rent from them under the threat of eviction but for the reason of getting rent from someone else, and they’re sure there were no self-modifications to make it look this way.
Another way that those agents might handle the situation is not to become boulders themselves, but to send boulders to make the offer. That is, send a minion to present the offer without any authority to discuss terms. I believe this often happens in the real world, e.g. customer service staff whose main goal, for their own continued employment, is to send the aggrieved customer away empty-handed and never refer the call upwards.
A smart agent can simply make decisions like a negotiator with restrictions on the kinds of terms it can accept, without having to spawn a “boulder” to do that.
You can just do the correct thing, without having to separate yourself into parts that do things correctly and a part that tries to not look at the world and spawns correct-thing-doers.
In Parfit’s Hitchhiker, you can just pay once you’re there, without precommiting/rewriting yourself into an agent that pays. You can just do the thing that wins.
Some agents can’t do the things that win and would have to rewrite themselves into something better and still lose in some problems, but you can be an agent that wins, and gradient descent probably crystallizes something that wins into what is making the decisions in smart enough things.
There is a no free lunch theorem for this. LDT (and everything else) can be irrational