I’m surprised by this. It seems to me like most of your reasoning about simplicity is either hand-wavy or only nominally formally backed by symbols which don’t (AFAICT) have much to do with the reality of neural networks.
The examples that you cite are from a LessWrong comment and a transcript of a talk that I gave. Of course when I’m presenting something in a context like that I’m not going to give the most formal version of it; that doesn’t mean that the informal hand-wavy arguments are the reasons why I believe what I believe.
Maybe a better objection there would be: then why haven’t you written up anything more careful and more formal? Which is a pretty fair objection, as I note here. But alas I only have so much time and it’s not my current focus.
Yes, but your original comment was presented as explaining “how to properly reason about counting arguments.” Do you no longer claim that to be the case? If you do still claim that, then I maintain my objection that you yourself used hand-wavy reasoning in that comment, and it seems incorrect to present that reasoning as unusually formally supported.
Another concern I have is, I don’t think you’re gaining anything by formality in this thread. As I understand your argument, I think your symbols are formalizations of hand-wavy intuitions (like the ability to “decompose” a network into the given pieces; the assumption that description length is meaningfully relevant to the NN prior; assumptions about informal notions of “simplicity” being realized in a given UTM prior). If anything, I think that the formality makes things worse because it makes it harder to evaluate or critique your claims.
I also don’t think I’ve seen an example of reasoning about deceptive alignment where I concluded that formality had helped the case, as opposed to obfuscated the case or lent the concern unearned credibility.
The main thing I was trying to show there is just that having the formalism prevents you from making logical mistakes in how to apply counting arguments in general, as I think was done in this post. So my comment is explaining how to use the formalism to avoid mistakes like that, not trying to work through the full argument for deceptive alignment.
It’s not that the formalism provides really strong evidence for deceptive alignment, it’s that it prevents you from making mistakes in your reasoning. It’s like plugging your argument into a proof-checker: it doesn’t check that your argument is correct, since the assumptions could be wrong, but it does check that your argument is sound.
Do you believe that the cited hand-wavy arguments are, at a high informal level, sound reason for belief in deceptive alignment? (It sounds like you don’t, going off of your original comment which seems to distance yourself from the counting arguments critiqued by the post.)
EDITed to remove last bit after reading elsewhere in thread.
I think you should allocate time to devising clearer arguments, then. I am worried that lots of people are misinterpreting your arguments and then making significant life choices on the basis of their new beliefs about deceptive alignment, and I think we’d both prefer for that to not happen.
Were I not busy with all sorts of empirical stuff right now, I would consider prioritizing a project like that, but alas I expect to be too busy. I think it would be great if somebody else wanted devote more time to working through the arguments in detail publicly, and I might encourage some of my mentees to do so.
The examples that you cite are from a LessWrong comment and a transcript of a talk that I gave. Of course when I’m presenting something in a context like that I’m not going to give the most formal version of it; that doesn’t mean that the informal hand-wavy arguments are the reasons why I believe what I believe.
Maybe a better objection there would be: then why haven’t you written up anything more careful and more formal? Which is a pretty fair objection, as I note here. But alas I only have so much time and it’s not my current focus.
Yes, but your original comment was presented as explaining “how to properly reason about counting arguments.” Do you no longer claim that to be the case? If you do still claim that, then I maintain my objection that you yourself used hand-wavy reasoning in that comment, and it seems incorrect to present that reasoning as unusually formally supported.
Another concern I have is, I don’t think you’re gaining anything by formality in this thread. As I understand your argument, I think your symbols are formalizations of hand-wavy intuitions (like the ability to “decompose” a network into the given pieces; the assumption that description length is meaningfully relevant to the NN prior; assumptions about informal notions of “simplicity” being realized in a given UTM prior). If anything, I think that the formality makes things worse because it makes it harder to evaluate or critique your claims.
I also don’t think I’ve seen an example of reasoning about deceptive alignment where I concluded that formality had helped the case, as opposed to obfuscated the case or lent the concern unearned credibility.
The main thing I was trying to show there is just that having the formalism prevents you from making logical mistakes in how to apply counting arguments in general, as I think was done in this post. So my comment is explaining how to use the formalism to avoid mistakes like that, not trying to work through the full argument for deceptive alignment.
It’s not that the formalism provides really strong evidence for deceptive alignment, it’s that it prevents you from making mistakes in your reasoning. It’s like plugging your argument into a proof-checker: it doesn’t check that your argument is correct, since the assumptions could be wrong, but it does check that your argument is sound.
Do you believe that the cited hand-wavy arguments are, at a high informal level, sound reason for belief in deceptive alignment? (It sounds like you don’t, going off of your original comment which seems to distance yourself from the counting arguments critiqued by the post.)
EDITed to remove last bit after reading elsewhere in thread.
I think they are valid if interpreted properly, but easy to misinterpret.
I think you should allocate time to devising clearer arguments, then. I am worried that lots of people are misinterpreting your arguments and then making significant life choices on the basis of their new beliefs about deceptive alignment, and I think we’d both prefer for that to not happen.
Were I not busy with all sorts of empirical stuff right now, I would consider prioritizing a project like that, but alas I expect to be too busy. I think it would be great if somebody else wanted devote more time to working through the arguments in detail publicly, and I might encourage some of my mentees to do so.