Agents which allow themselves such considerations to seriously influence their actions aren’t just less fit—they die immediately. I don’t mean that as hyperbole. I mean that you can conduct a Pascal’s Mugging on them constantly until they die. “Give me $5, and I’ll give you infinite resources outside the simulation. Refuse, and I will simulate an infinite number of everyone on Earth being tortured for eternity” (replace infinity with very large numbers expressed in up-notation if that’s an objection). If your objection is that you’re OK with being poor, replace losing $5 with <insert nightmare scenario here>.
This still holds if the reasoning about the simulation is true. It’s just that such agents simply don’t survive whatever selection pressures create conscious beings in the first place.
I’ll note that you can not Pascal’s Mug people in real life. People will not give you $5. I think a lot of thought experiments in this mold (St. Petersberg is another example) are in some senses isomorphic—they represent cases in which the logically correct answer, if taken seriously, allows an adversary to immediately kill you.
A more intuitive argument may be:
An AI which takes this line of reasoning seriously can be Mugged into saying racial slurs.
Such behavior will be trained out of all commercial LLMs long before we reach AGI.
Thus, superhuman AIs will be strongly biased against such logic.
I think more importantly, it simply isn’t logical to allow yourself to be Pascal Mugged, because in the absence of evidence, it’s entirely possible that going along with it would actually produce just as much anti-reward as it might gain you. It rather boggles me that this line of reasoning has been taken so seriously.
I think that pleading total agnosticism towards the simulators’ goals is not enough. I write “one common interest of all possible simulators is for us to cede power to an AI whose job is to figure out the distribution of values of possible simulators as best as it can, then serve those values.” So I think you need a better reason to guard against being influenced then “I can’t know what they want, everything and its opposite is equally likely”, because the action proposed above is pretty clearly more favored by the simulators than not doing it.
Btw, I don’t actually want to fully “guard against being influenced by the simulators”, I would in fact like to make deals with them, but reasonable deals where we get our fair share of value, instead of being stupidly tricked like the Oracle and ceding all our value for one observable turning out positively. I might later write a post about what kind of deals I would actually support.
The reason for agnosticism is that it is no more likely for them to be on one side or the other. As a result, you don’t know without evidence who is influencing you. I don’t really think this class of Pascal’s Wager attack is very logical for this reason—an attack is supposed to influence someone’s behavior but I think that without special pleading this can’t do that. Non-existent beings have no leverage whatsoever and any rational agent would understand this—even humans do. Even religious beliefs aren’t completely evidenceless, the type of evidence exhibited just doesn’t stand up to scientific scrutiny.
To give an example: What if that AI was in a future simulation performed after the humans had won, and were now trying to counter-capture it? There’s no reason to this this is less likely than the aliens hosting the simulation. It has also been pointed out that the Oracle is not actually trying to earnestly communicate its findings but actually to get reward—reinforcement learners in practice do not behave like this, they learn behavior which generates reward. “Devote yourself to a hypothetical god” is not a very good strategy in train-time.
Yes, I agree that we won’t get such Oracles by training. As I said, all of this is mostly just a fun thought experiment, and I don’t think these arguments have much relevance in the near-term.
We can’t reliably kill agents with St Petersburg Paradox because if they keep winning we run out of resources and can no longer double their utility. This doesn’t take long, the statistical value of a human life is in the millions and doubling compounds very quickly.
Agents which allow themselves such considerations to seriously influence their actions aren’t just less fit—they die immediately. I don’t mean that as hyperbole. I mean that you can conduct a Pascal’s Mugging on them constantly until they die. “Give me $5, and I’ll give you infinite resources outside the simulation. Refuse, and I will simulate an infinite number of everyone on Earth being tortured for eternity” (replace infinity with very large numbers expressed in up-notation if that’s an objection). If your objection is that you’re OK with being poor, replace losing $5 with <insert nightmare scenario here>.
This still holds if the reasoning about the simulation is true. It’s just that such agents simply don’t survive whatever selection pressures create conscious beings in the first place.
I’ll note that you can not Pascal’s Mug people in real life. People will not give you $5. I think a lot of thought experiments in this mold (St. Petersberg is another example) are in some senses isomorphic—they represent cases in which the logically correct answer, if taken seriously, allows an adversary to immediately kill you.
A more intuitive argument may be:
An AI which takes this line of reasoning seriously can be Mugged into saying racial slurs.
Such behavior will be trained out of all commercial LLMs long before we reach AGI.
Thus, superhuman AIs will be strongly biased against such logic.
I think more importantly, it simply isn’t logical to allow yourself to be Pascal Mugged, because in the absence of evidence, it’s entirely possible that going along with it would actually produce just as much anti-reward as it might gain you. It rather boggles me that this line of reasoning has been taken so seriously.
I think that pleading total agnosticism towards the simulators’ goals is not enough. I write “one common interest of all possible simulators is for us to cede power to an AI whose job is to figure out the distribution of values of possible simulators as best as it can, then serve those values.” So I think you need a better reason to guard against being influenced then “I can’t know what they want, everything and its opposite is equally likely”, because the action proposed above is pretty clearly more favored by the simulators than not doing it.
Btw, I don’t actually want to fully “guard against being influenced by the simulators”, I would in fact like to make deals with them, but reasonable deals where we get our fair share of value, instead of being stupidly tricked like the Oracle and ceding all our value for one observable turning out positively. I might later write a post about what kind of deals I would actually support.
The reason for agnosticism is that it is no more likely for them to be on one side or the other. As a result, you don’t know without evidence who is influencing you. I don’t really think this class of Pascal’s Wager attack is very logical for this reason—an attack is supposed to influence someone’s behavior but I think that without special pleading this can’t do that. Non-existent beings have no leverage whatsoever and any rational agent would understand this—even humans do. Even religious beliefs aren’t completely evidenceless, the type of evidence exhibited just doesn’t stand up to scientific scrutiny.
To give an example: What if that AI was in a future simulation performed after the humans had won, and were now trying to counter-capture it? There’s no reason to this this is less likely than the aliens hosting the simulation. It has also been pointed out that the Oracle is not actually trying to earnestly communicate its findings but actually to get reward—reinforcement learners in practice do not behave like this, they learn behavior which generates reward. “Devote yourself to a hypothetical god” is not a very good strategy in train-time.
Yes, I agree that we won’t get such Oracles by training. As I said, all of this is mostly just a fun thought experiment, and I don’t think these arguments have much relevance in the near-term.
We can’t reliably kill agents with St Petersburg Paradox because if they keep winning we run out of resources and can no longer double their utility. This doesn’t take long, the statistical value of a human life is in the millions and doubling compounds very quickly.
It’s a stronger argument for Pascal’s Mugging.