TurnTrout comments on Counting arguments provide no evidence for AI doom

TurnTrout 4 Mar 2024 16:04 UTC
LW: 9 AF: 4
0
AF
1. I am very skeptical of hand-wavy arguments about simplicity that don’t have formal mathematical backing. This is a very difficult area to reason about correctly and it’s easy to go off the rails if you’re trying to do so without relying on any formalism.
I’m surprised by this. It seems to me like most of your reasoning about simplicity is either hand-wavy or only nominally formally backed by symbols which don’t (AFAICT) have much to do with the reality of neural networks. EG, your comments above:
I would usually then make an argument here for why in most cases the simplest objective that leads to deception is simpler than the simplest objective that leads to alignment, but that’s just a simplicity argument, not a counting argument. Since we want to do the counting argument here, let’s assume that the simplest objective that leads to alignment is simpler than the simplest objective that leads to deception.
Or the times you’ve talked about how there are “more” sycophants but only “one” saint.
1. There are many, many ways to adjust the formalism to take into account various ways in which realistic neural network inductive biases are different than basic simplicity biases. My sense is that most of these changes generally don’t change the bottom-line conclusion, but if you have a concrete mathematical model that you’d like to present here that you think gives a different result, I’m all ears.
This is a very strange burden of proof. It seems to me that you presented a specific model of how NNs work which is clearly incorrect, and instead of processing counterarguments that it doesn’t make sense, you want someone else to propose to you a similarly detailed model which you think is better. Presenting an alternative is a logically separate task from pointing out the problems in the model you gave.
- evhub 4 Mar 2024 19:56 UTC
  LW: 7 AF: 6
  3
  AF Parent
  
  I’m surprised by this. It seems to me like most of your reasoning about simplicity is either hand-wavy or only nominally formally backed by symbols which don’t (AFAICT) have much to do with the reality of neural networks.
  
  The examples that you cite are from a LessWrong comment and a transcript of a talk that I gave. Of course when I’m presenting something in a context like that I’m not going to give the most formal version of it; that doesn’t mean that the informal hand-wavy arguments are the reasons why I believe what I believe.
  
  Maybe a better objection there would be: then why haven’t you written up anything more careful and more formal? Which is a pretty fair objection, as I note here. But alas I only have so much time and it’s not my current focus.
  - TurnTrout 5 Mar 2024 1:08 UTC
    LW: 10 AF: 5
    9
    AF Parent
    Yes, but your original comment was presented as explaining “how to properly reason about counting arguments.” Do you no longer claim that to be the case? If you do still claim that, then I maintain my objection that you yourself used hand-wavy reasoning in that comment, and it seems incorrect to present that reasoning as unusually formally supported.
    Another concern I have is, I don’t think you’re gaining anything by formality in this thread. As I understand your argument, I think your symbols are formalizations of hand-wavy intuitions (like the ability to “decompose” a network into the given pieces; the assumption that description length is meaningfully relevant to the NN prior; assumptions about informal notions of “simplicity” being realized in a given UTM prior). If anything, I think that the formality makes things worse because it makes it harder to evaluate or critique your claims.
    I also don’t think I’ve seen an example of reasoning about deceptive alignment where I concluded that formality had helped the case, as opposed to obfuscated the case or lent the concern unearned credibility.
    - evhub 5 Mar 2024 1:13 UTC
      LW: 3 AF: 2
      1
      AF Parent
      The main thing I was trying to show there is just that having the formalism prevents you from making logical mistakes in how to apply counting arguments in general, as I think was done in this post. So my comment is explaining how to use the formalism to avoid mistakes like that, not trying to work through the full argument for deceptive alignment.
      
      It’s not that the formalism provides really strong evidence for deceptive alignment, it’s that it prevents you from making mistakes in your reasoning. It’s like plugging your argument into a proof-checker: it doesn’t check that your argument is correct, since the assumptions could be wrong, but it does check that your argument is sound.
  - TurnTrout 4 Mar 2024 20:41 UTC
    LW: 4 AF: 3
    0
    AF Parent
    Do you believe that the cited hand-wavy arguments are, at a high informal level, sound reason for belief in deceptive alignment? (It sounds like you don’t, going off of your original comment which seems to distance yourself from the counting arguments critiqued by the post.)
    EDITed to remove last bit after reading elsewhere in thread.
    - evhub 4 Mar 2024 20:43 UTC
      LW: 9 AF: 7
      0
      AF Parent
      I think they are valid if interpreted properly, but easy to misinterpret.
      - TurnTrout 5 Mar 2024 1:13 UTC
        LW: 13 AF: 7
        10
        AF Parent
        I think you should allocate time to devising clearer arguments, then. I am worried that lots of people are misinterpreting your arguments and then making significant life choices on the basis of their new beliefs about deceptive alignment, and I think we’d both prefer for that to not happen.
        evhub 5 Mar 2024 1:17 UTC
        LW: 3 AF: 2
        0
        AF Parent
        Were I not busy with all sorts of empirical stuff right now, I would consider prioritizing a project like that, but alas I expect to be too busy. I think it would be great if somebody else wanted devote more time to working through the arguments in detail publicly, and I might encourage some of my mentees to do so.