Toy example. Their are 3 possible AI’s. For simplicity assume a prior where each has a 1/3rd chance of existing. (1) The one that is honest, (2) the one that will try and deceive you but be detected, and (3) the one that will deceive, but do so well enough to not be detected.
We either detect deception or we do not (I agree this is binary). In the event we detect deception we can rule out options (1) and (3), and thus update to believing we have AI number (2) (with probability 1). In the event we do not detect deception we rule out option (2), and thus (if the evidence was perfect) we would update to a 50⁄50 distribution over options (1) and (3) - so that the probability we assign to option (3) has increased from 1⁄3 to 1⁄2.
State: probabilities over (1), (2), (3)
Before test: 1⁄3, 1⁄3, 1⁄3
If test reveals deception: 0, 1, 0
If no deception detected: 1⁄2, 0, 1⁄2
I agree that the opening sentence that localdiety quotes above, taken alone, is highly suspect. It does fall foul of the rule, as it does not draw a distinction between options (2) and (3) - both are labelled “deceptive”*. However, in its wider context the article is, I think, making a point like the one outlined in my toy example. The fact that seeing no deception narrows us down to options (1) and (3) is the context for the discussion about priors and scaling-laws and so on in the rest of the original post. (See the two options under “There are at least two ways to interpret this:” in the main post—those two ways are options (1) and (3)).
* clumping the two “deceptive” ones together the conservation holds fine. The probability of having either (2) or (3) was initially 2⁄3. After the test it is either 1 or 1⁄2 depending on the outcome of the test. The test has a 1⁄3 chance of gaining us a 1/3rd certainty of deception, and it has a 2⁄3 chance of loosing us a 1⁄6 certainty of deception. So the conservation works out, if you look at it in the binary way. But I think the context for the post is that what we really care about is whether we have option (3) or not, and the lack of deception detected (in the simplistic view) increases the odds of (3).
Toy example. Their are 3 possible AI’s. For simplicity assume a prior where each has a 1/3rd chance of existing. (1) The one that is honest, (2) the one that will try and deceive you but be detected, and (3) the one that will deceive, but do so well enough to not be detected.
We either detect deception or we do not (I agree this is binary). In the event we detect deception we can rule out options (1) and (3), and thus update to believing we have AI number (2) (with probability 1). In the event we do not detect deception we rule out option (2), and thus (if the evidence was perfect) we would update to a 50⁄50 distribution over options (1) and (3) - so that the probability we assign to option (3) has increased from 1⁄3 to 1⁄2.
State: probabilities over (1), (2), (3)
Before test: 1⁄3, 1⁄3, 1⁄3
If test reveals deception: 0, 1, 0
If no deception detected: 1⁄2, 0, 1⁄2
I agree that the opening sentence that localdiety quotes above, taken alone, is highly suspect. It does fall foul of the rule, as it does not draw a distinction between options (2) and (3) - both are labelled “deceptive”*. However, in its wider context the article is, I think, making a point like the one outlined in my toy example. The fact that seeing no deception narrows us down to options (1) and (3) is the context for the discussion about priors and scaling-laws and so on in the rest of the original post. (See the two options under “There are at least two ways to interpret this:” in the main post—those two ways are options (1) and (3)).
* clumping the two “deceptive” ones together the conservation holds fine. The probability of having either (2) or (3) was initially 2⁄3. After the test it is either 1 or 1⁄2 depending on the outcome of the test. The test has a 1⁄3 chance of gaining us a 1/3rd certainty of deception, and it has a 2⁄3 chance of loosing us a 1⁄6 certainty of deception. So the conservation works out, if you look at it in the binary way. But I think the context for the post is that what we really care about is whether we have option (3) or not, and the lack of deception detected (in the simplistic view) increases the odds of (3).
Makes sense. Thanks for the clarification!