I think that you are slightly mishandling the law of conservation of expected evidence by making something non-binary into a binary question.
Say, for example that the the AI could be in one of 3 states: (1) completely honest, (2) partly effective at deception, or (3) very adept at deception. If we see no signs of deception then it rules out option (2). It is easy to see how ruling out option (2) might increase our probability weight on either/both options (1) and (3).
[Interestingly enough in the linked “law of conservation of expected evidence” there is something I think is an analogous example to do with Japanese internment camps.]
Thinking we’ve seen deception is an event in the probability space, and so is its negation. These two events partition the space. Binary event partition. You’re talking about three hypotheses. So, I don’t see how this answers localdeity’s point. This indeed runs afoul of conservation of expected evidence. (And I just made up some numbers and checked the posterior, and it indeed followed localdeity’s point.)
If not, can you give me a prior probability distribution on the three hypotheses, the probabilities they assign to the event “I don’t think I saw deception”, and then show that the posterior increases whether or not that event happens, or its complement?
Toy example. Their are 3 possible AI’s. For simplicity assume a prior where each has a 1/3rd chance of existing. (1) The one that is honest, (2) the one that will try and deceive you but be detected, and (3) the one that will deceive, but do so well enough to not be detected.
We either detect deception or we do not (I agree this is binary). In the event we detect deception we can rule out options (1) and (3), and thus update to believing we have AI number (2) (with probability 1). In the event we do not detect deception we rule out option (2), and thus (if the evidence was perfect) we would update to a 50⁄50 distribution over options (1) and (3) - so that the probability we assign to option (3) has increased from 1⁄3 to 1⁄2.
State: probabilities over (1), (2), (3)
Before test: 1⁄3, 1⁄3, 1⁄3
If test reveals deception: 0, 1, 0
If no deception detected: 1⁄2, 0, 1⁄2
I agree that the opening sentence that localdiety quotes above, taken alone, is highly suspect. It does fall foul of the rule, as it does not draw a distinction between options (2) and (3) - both are labelled “deceptive”*. However, in its wider context the article is, I think, making a point like the one outlined in my toy example. The fact that seeing no deception narrows us down to options (1) and (3) is the context for the discussion about priors and scaling-laws and so on in the rest of the original post. (See the two options under “There are at least two ways to interpret this:” in the main post—those two ways are options (1) and (3)).
* clumping the two “deceptive” ones together the conservation holds fine. The probability of having either (2) or (3) was initially 2⁄3. After the test it is either 1 or 1⁄2 depending on the outcome of the test. The test has a 1⁄3 chance of gaining us a 1/3rd certainty of deception, and it has a 2⁄3 chance of loosing us a 1⁄6 certainty of deception. So the conservation works out, if you look at it in the binary way. But I think the context for the post is that what we really care about is whether we have option (3) or not, and the lack of deception detected (in the simplistic view) increases the odds of (3).
Related: you can have different beliefs about AI’s of different scales. Seeing that deception capabilities increase with scale should make you suspicious of larger models even if you see that the larger models are worse at deceptive capabilities. Whereas it would be more reasonable to take “lack of deceptive capabilities” at face value in large models if you saw the same in models of all smaller sizes (the way that falls through is if deceptive capabilities develop discontinuously, so that there’s a jump from nothing to fully capable of fooling you).
I think that you are slightly mishandling the law of conservation of expected evidence by making something non-binary into a binary question.
Say, for example that the the AI could be in one of 3 states: (1) completely honest, (2) partly effective at deception, or (3) very adept at deception. If we see no signs of deception then it rules out option (2). It is easy to see how ruling out option (2) might increase our probability weight on either/both options (1) and (3).
[Interestingly enough in the linked “law of conservation of expected evidence” there is something I think is an analogous example to do with Japanese internment camps.]
Thinking we’ve seen deception is an event in the probability space, and so is its negation. These two events partition the space. Binary event partition. You’re talking about three hypotheses. So, I don’t see how this answers localdeity’s point. This indeed runs afoul of conservation of expected evidence. (And I just made up some numbers and checked the posterior, and it indeed followed localdeity’s point.)
If not, can you give me a prior probability distribution on the three hypotheses, the probabilities they assign to the event “I don’t think I saw deception”, and then show that the posterior increases whether or not that event happens, or its complement?
Toy example. Their are 3 possible AI’s. For simplicity assume a prior where each has a 1/3rd chance of existing. (1) The one that is honest, (2) the one that will try and deceive you but be detected, and (3) the one that will deceive, but do so well enough to not be detected.
We either detect deception or we do not (I agree this is binary). In the event we detect deception we can rule out options (1) and (3), and thus update to believing we have AI number (2) (with probability 1). In the event we do not detect deception we rule out option (2), and thus (if the evidence was perfect) we would update to a 50⁄50 distribution over options (1) and (3) - so that the probability we assign to option (3) has increased from 1⁄3 to 1⁄2.
State: probabilities over (1), (2), (3)
Before test: 1⁄3, 1⁄3, 1⁄3
If test reveals deception: 0, 1, 0
If no deception detected: 1⁄2, 0, 1⁄2
I agree that the opening sentence that localdiety quotes above, taken alone, is highly suspect. It does fall foul of the rule, as it does not draw a distinction between options (2) and (3) - both are labelled “deceptive”*. However, in its wider context the article is, I think, making a point like the one outlined in my toy example. The fact that seeing no deception narrows us down to options (1) and (3) is the context for the discussion about priors and scaling-laws and so on in the rest of the original post. (See the two options under “There are at least two ways to interpret this:” in the main post—those two ways are options (1) and (3)).
* clumping the two “deceptive” ones together the conservation holds fine. The probability of having either (2) or (3) was initially 2⁄3. After the test it is either 1 or 1⁄2 depending on the outcome of the test. The test has a 1⁄3 chance of gaining us a 1/3rd certainty of deception, and it has a 2⁄3 chance of loosing us a 1⁄6 certainty of deception. So the conservation works out, if you look at it in the binary way. But I think the context for the post is that what we really care about is whether we have option (3) or not, and the lack of deception detected (in the simplistic view) increases the odds of (3).
Makes sense. Thanks for the clarification!
Related: you can have different beliefs about AI’s of different scales. Seeing that deception capabilities increase with scale should make you suspicious of larger models even if you see that the larger models are worse at deceptive capabilities. Whereas it would be more reasonable to take “lack of deceptive capabilities” at face value in large models if you saw the same in models of all smaller sizes (the way that falls through is if deceptive capabilities develop discontinuously, so that there’s a jump from nothing to fully capable of fooling you).