I am very skeptical of hand-wavy arguments about simplicity that don’t have formal mathematical backing. This is a very difficult area to reason about correctly and it’s easy to go off the rails if you’re trying to do so without relying on any formalism.
I’m surprised by this. It seems to me like most of your reasoning about simplicity is either hand-wavy or only nominally formally backed by symbols which don’t (AFAICT) have much to do with the reality of neural networks. EG, your comments above:
I would usually then make an argument here for why in most cases the simplest objective that leads to deception is simpler than the simplest objective that leads to alignment, but that’s just a simplicity argument, not a counting argument. Since we want to do the counting argument here, let’s assume that the simplest objective that leads to alignment is simpler than the simplest objective that leads to deception.
Or the times you’ve talked about how there are “more” sycophants but only “one” saint.
There are many, many ways to adjust the formalism to take into account various ways in which realistic neural network inductive biases are different than basic simplicity biases. My sense is that most of these changes generally don’t change the bottom-line conclusion, but if you have a concrete mathematical model that you’d like to present here that you think gives a different result, I’m all ears.
This is a very strange burden of proof. It seems to me that you presented a specific model of how NNs work which is clearly incorrect, and instead of processing counterarguments that it doesn’t make sense, you want someone else to propose to you a similarly detailed model which you think is better. Presenting an alternative is a logically separate task from pointing out the problems in the model you gave.
I’m surprised by this. It seems to me like most of your reasoning about simplicity is either hand-wavy or only nominally formally backed by symbols which don’t (AFAICT) have much to do with the reality of neural networks.
The examples that you cite are from a LessWrong comment and a transcript of a talk that I gave. Of course when I’m presenting something in a context like that I’m not going to give the most formal version of it; that doesn’t mean that the informal hand-wavy arguments are the reasons why I believe what I believe.
Maybe a better objection there would be: then why haven’t you written up anything more careful and more formal? Which is a pretty fair objection, as I note here. But alas I only have so much time and it’s not my current focus.
Yes, but your original comment was presented as explaining “how to properly reason about counting arguments.” Do you no longer claim that to be the case? If you do still claim that, then I maintain my objection that you yourself used hand-wavy reasoning in that comment, and it seems incorrect to present that reasoning as unusually formally supported.
Another concern I have is, I don’t think you’re gaining anything by formality in this thread. As I understand your argument, I think your symbols are formalizations of hand-wavy intuitions (like the ability to “decompose” a network into the given pieces; the assumption that description length is meaningfully relevant to the NN prior; assumptions about informal notions of “simplicity” being realized in a given UTM prior). If anything, I think that the formality makes things worse because it makes it harder to evaluate or critique your claims.
I also don’t think I’ve seen an example of reasoning about deceptive alignment where I concluded that formality had helped the case, as opposed to obfuscated the case or lent the concern unearned credibility.
The main thing I was trying to show there is just that having the formalism prevents you from making logical mistakes in how to apply counting arguments in general, as I think was done in this post. So my comment is explaining how to use the formalism to avoid mistakes like that, not trying to work through the full argument for deceptive alignment.
It’s not that the formalism provides really strong evidence for deceptive alignment, it’s that it prevents you from making mistakes in your reasoning. It’s like plugging your argument into a proof-checker: it doesn’t check that your argument is correct, since the assumptions could be wrong, but it does check that your argument is sound.
Do you believe that the cited hand-wavy arguments are, at a high informal level, sound reason for belief in deceptive alignment? (It sounds like you don’t, going off of your original comment which seems to distance yourself from the counting arguments critiqued by the post.)
EDITed to remove last bit after reading elsewhere in thread.
I think you should allocate time to devising clearer arguments, then. I am worried that lots of people are misinterpreting your arguments and then making significant life choices on the basis of their new beliefs about deceptive alignment, and I think we’d both prefer for that to not happen.
Were I not busy with all sorts of empirical stuff right now, I would consider prioritizing a project like that, but alas I expect to be too busy. I think it would be great if somebody else wanted devote more time to working through the arguments in detail publicly, and I might encourage some of my mentees to do so.
I’m surprised by this. It seems to me like most of your reasoning about simplicity is either hand-wavy or only nominally formally backed by symbols which don’t (AFAICT) have much to do with the reality of neural networks. EG, your comments above:
Or the times you’ve talked about how there are “more” sycophants but only “one” saint.
This is a very strange burden of proof. It seems to me that you presented a specific model of how NNs work which is clearly incorrect, and instead of processing counterarguments that it doesn’t make sense, you want someone else to propose to you a similarly detailed model which you think is better. Presenting an alternative is a logically separate task from pointing out the problems in the model you gave.
The examples that you cite are from a LessWrong comment and a transcript of a talk that I gave. Of course when I’m presenting something in a context like that I’m not going to give the most formal version of it; that doesn’t mean that the informal hand-wavy arguments are the reasons why I believe what I believe.
Maybe a better objection there would be: then why haven’t you written up anything more careful and more formal? Which is a pretty fair objection, as I note here. But alas I only have so much time and it’s not my current focus.
Yes, but your original comment was presented as explaining “how to properly reason about counting arguments.” Do you no longer claim that to be the case? If you do still claim that, then I maintain my objection that you yourself used hand-wavy reasoning in that comment, and it seems incorrect to present that reasoning as unusually formally supported.
Another concern I have is, I don’t think you’re gaining anything by formality in this thread. As I understand your argument, I think your symbols are formalizations of hand-wavy intuitions (like the ability to “decompose” a network into the given pieces; the assumption that description length is meaningfully relevant to the NN prior; assumptions about informal notions of “simplicity” being realized in a given UTM prior). If anything, I think that the formality makes things worse because it makes it harder to evaluate or critique your claims.
I also don’t think I’ve seen an example of reasoning about deceptive alignment where I concluded that formality had helped the case, as opposed to obfuscated the case or lent the concern unearned credibility.
The main thing I was trying to show there is just that having the formalism prevents you from making logical mistakes in how to apply counting arguments in general, as I think was done in this post. So my comment is explaining how to use the formalism to avoid mistakes like that, not trying to work through the full argument for deceptive alignment.
It’s not that the formalism provides really strong evidence for deceptive alignment, it’s that it prevents you from making mistakes in your reasoning. It’s like plugging your argument into a proof-checker: it doesn’t check that your argument is correct, since the assumptions could be wrong, but it does check that your argument is sound.
Do you believe that the cited hand-wavy arguments are, at a high informal level, sound reason for belief in deceptive alignment? (It sounds like you don’t, going off of your original comment which seems to distance yourself from the counting arguments critiqued by the post.)
EDITed to remove last bit after reading elsewhere in thread.
I think they are valid if interpreted properly, but easy to misinterpret.
I think you should allocate time to devising clearer arguments, then. I am worried that lots of people are misinterpreting your arguments and then making significant life choices on the basis of their new beliefs about deceptive alignment, and I think we’d both prefer for that to not happen.
Were I not busy with all sorts of empirical stuff right now, I would consider prioritizing a project like that, but alas I expect to be too busy. I think it would be great if somebody else wanted devote more time to working through the arguments in detail publicly, and I might encourage some of my mentees to do so.