While I agree that the algorithm might output 5, I don’t share the intuition that it’s something that wasn’t ‘supposed’ to happen, so I’m not sure what problem it was meant to demonstrate. I thought of a few ways to interpret it, but I’m not sure which one, if any, was the intended interpretation:
a) The algorithm is defined to compute argmax, but it doesn’t output argmax because of false antecedents.
- but I would say that it’s not actually defined to compute argmax, therefore the fact that it doesn’t output argmax is not a problem.
b) Regardless of the output, the algorithm uses reasoning from false antecedents, which seems nonsensical from the perspective of someone who uses intuitive conditionals, which impedes its reasoning.
- it may indeed seem nonsensical, but if ‘seeming nonsensical’ doesn’t actually impede its ability to select actions wich highest utility (when it’s actually defined to compute argmax), then I would say that it’s also not a problem. Furthermore, wouldn’t MUDT be perfectly satisfied with the tuplep1:(x=0,y=10,A()=10,U()=10) ? It also uses ‘nonsensical’ reasoning ‘A()=5 ⇒ U()=0’ but still outputs action with highest utility.
c) Even when the use of false antecedents doesn’t impede its reasoning, the way it arrives at its conclusions is counterintuitive to humans, which means that we’re more likely to make a catastrophic mistake when reasoning about how the agent reasons.
- Maybe? I don’t have access to other people’s intuitions, but when I read the example, I didn’t have any intuitive feeling of what the algorithm would do, so instead I just calculated all assignments (x,y)∈{0,5,10}2, eliminated all inconsistent ones and proceeded from there. And this issue wouldn’t be unique to false antecedents, there are other perfectly valid pieces of logic that might nonetheless seem counterintuitive to humans, for example the puzzle with islanders and blue eyes.
Yet, we reason informally from false antecedents all the time, EG thinking about what would happen if
When I try to examine my own reasoning, I find that when I do so, I’m just selectively blind to certain details and so don’t notice any problems. For example: suppose the environment calculates “U=10 if action = A; U=0 if action = B” and I, being a utility maximizer, am deciding between actions A and B. Then I might imagine something like “I chose A and got 10 utils”, and “I chose B and got 0 utils”—ergo, I should choose A.
But actually, if I had thought deeper about the second case, I would also think “hm, because I’m determined to choose the action with highest reward I would not choose B. And yet I chose B. This is logically impossible! OH NO THIS TIMELINE IS INCONSISTENT!”—so I couldn’t actually coherently reason about what could happen if I chose B. And yet, I would still be left with the only consistent timeline where I choose A, which I would promptly follow, and get my maximum of 10 utils.
The problem is also “solved” if the agent thinks only about the environment, ignoring its knowledge about its own source code.
The idea with reversing the outputs and taking the assignment that is valid for both versions of the algorithm seemed to me to be closer to the notion “but what would actually happen if you actually acted differently”, i.e. avoiding seemingly nonsensical reasoning while preserving self-reflection. But I’m not sure when, if ever, this principle can be generalized.
While I agree that the algorithm might output 5, I don’t share the intuition that it’s something that wasn’t ‘supposed’ to happen, so I’m not sure what problem it was meant to demonstrate.
OK, this makes sense to me. Instead of your (A) and (B), I would offer the following two useful interpretations:
1: From a design perspective, the algorithm chooses 5 when 10 is better. I’m not saying it has “computed argmax incorrectly” (as in your A); an agent design isn’t supposed to compute argmax (argmax would be insufficient to solve this problem, because we’re not given the problem in the format of a function from our actions to scores), but it is supposed to “do well”. The usefulness of the argument rests on the weight of “someone might code an agent like this on accident, if they’re not familiar with spurious proofs”. Indeed, that’s the origin of this code snippet—something like this was seriously proposed at some point.
2: From a descriptive perspective, the code snippet is not a very good description of how humans would reason about a situation like this (for all the same reasons).
When I try to examine my own reasoning, I find that when I do so, I’m just selectively blind to certain details and so don’t notice any problems. For example: suppose the environment calculates “U=10 if action = A; U=0 if action = B” and I, being a utility maximizer, am deciding between actions A and B. Then I might imagine something like “I chose A and got 10 utils”, and “I chose B and got 0 utils”—ergo, I should choose A.
Right, this makes sense to me, and is an intuition which I many people share. The problem, then, is to formalize how to be “selectively blind” in an appropriate way such that you reliably get good results.
While I agree that the algorithm might output 5, I don’t share the intuition that it’s something that wasn’t ‘supposed’ to happen, so I’m not sure what problem it was meant to demonstrate. I thought of a few ways to interpret it, but I’m not sure which one, if any, was the intended interpretation:
a) The algorithm is defined to compute argmax, but it doesn’t output argmax because of false antecedents.
- but I would say that it’s not actually defined to compute argmax, therefore the fact that it doesn’t output argmax is not a problem.
b) Regardless of the output, the algorithm uses reasoning from false antecedents, which seems nonsensical from the perspective of someone who uses intuitive conditionals, which impedes its reasoning.
- it may indeed seem nonsensical, but if ‘seeming nonsensical’ doesn’t actually impede its ability to select actions wich highest utility (when it’s actually defined to compute argmax), then I would say that it’s also not a problem. Furthermore, wouldn’t MUDT be perfectly satisfied with the tuplep1:(x=0,y=10,A()=10,U()=10) ? It also uses ‘nonsensical’ reasoning ‘A()=5 ⇒ U()=0’ but still outputs action with highest utility.
c) Even when the use of false antecedents doesn’t impede its reasoning, the way it arrives at its conclusions is counterintuitive to humans, which means that we’re more likely to make a catastrophic mistake when reasoning about how the agent reasons.
- Maybe? I don’t have access to other people’s intuitions, but when I read the example, I didn’t have any intuitive feeling of what the algorithm would do, so instead I just calculated all assignments (x,y)∈{0,5,10}2, eliminated all inconsistent ones and proceeded from there. And this issue wouldn’t be unique to false antecedents, there are other perfectly valid pieces of logic that might nonetheless seem counterintuitive to humans, for example the puzzle with islanders and blue eyes.
When I try to examine my own reasoning, I find that when I do so, I’m just selectively blind to certain details and so don’t notice any problems. For example: suppose the environment calculates “U=10 if action = A; U=0 if action = B” and I, being a utility maximizer, am deciding between actions A and B. Then I might imagine something like “I chose A and got 10 utils”, and “I chose B and got 0 utils”—ergo, I should choose A.
But actually, if I had thought deeper about the second case, I would also think “hm, because I’m determined to choose the action with highest reward I would not choose B. And yet I chose B. This is logically impossible! OH NO THIS TIMELINE IS INCONSISTENT!”—so I couldn’t actually coherently reason about what could happen if I chose B. And yet, I would still be left with the only consistent timeline where I choose A, which I would promptly follow, and get my maximum of 10 utils.
The idea with reversing the outputs and taking the assignment that is valid for both versions of the algorithm seemed to me to be closer to the notion “but what would actually happen if you actually acted differently”, i.e. avoiding seemingly nonsensical reasoning while preserving self-reflection. But I’m not sure when, if ever, this principle can be generalized.
OK, this makes sense to me. Instead of your (A) and (B), I would offer the following two useful interpretations:
1: From a design perspective, the algorithm chooses 5 when 10 is better. I’m not saying it has “computed argmax incorrectly” (as in your A); an agent design isn’t supposed to compute argmax (argmax would be insufficient to solve this problem, because we’re not given the problem in the format of a function from our actions to scores), but it is supposed to “do well”. The usefulness of the argument rests on the weight of “someone might code an agent like this on accident, if they’re not familiar with spurious proofs”. Indeed, that’s the origin of this code snippet—something like this was seriously proposed at some point.
2: From a descriptive perspective, the code snippet is not a very good description of how humans would reason about a situation like this (for all the same reasons).
Right, this makes sense to me, and is an intuition which I many people share. The problem, then, is to formalize how to be “selectively blind” in an appropriate way such that you reliably get good results.