my original 100:1 was a typo, where i meant 2^-100:1.
this number was in reference to ronny’s 2^-10000:1.
when ronny said:
I’m like look, I used to think the chances of alignment by default were like 2^-10000:1
i interpreted him to mean “i expect it takes 10k bits of description to nail down human values, and so if one is literally randomly sampling programs, they should naively expect 1:2^10000 odds against alignment”.
i personally think this is wrong, for reasons brought up later in the convo—namely, the relevant question is not how many bits is takes to specify human values relative to the python standard library; the relevant question is how many bits it takes to specify human values relative to the training observations.
but this was before i raised that objection, and my understanding of ronny’s position was something like “specifying human values (in full, without reference to the observations) probably takes ~10k bits in python, but for all i know it takes very few bits in ML models”. to which i was attempting to reply “man, i can see enough ways that ML models could turn out that i’m pretty sure it’d still take at least 100 bits”.
i inserted the hedge “in the very strongest sense” to stave off exactly your sort of objection; the very strongest sense of “alignment-by-default” is that you sample any old model that performs well on some task (without attempting alignment at all) and hope that it’s aligned (e.g. b/c maybe the human-ish way to perform well on tasks is the ~only way to perform well on tasks and so we find some great convergence); here i was trying to say something like “i think that i can see enough other ways to perform well on tasks that there’s e.g. at least ~33 knobs with at least ~10 settings such that you have to get them all right before the AI does something valuable with the stars”.
this was not meant to be an argument that alignment actually has odds less than 2^-100, for various reasons, including but not limited to: any attempt by humans to try at all takes you into a whole new regime; there’s more than a 2^-100 chance that there’s some correlation between the various knobs for some reason; and the odds of my being wrong about the biases of SGD are greater than 2^-100 (case-in-point: i think ronny was wrong about the 2^-100000 claim, on account of the point about the relevant number being relative to the observations).
my betting odds would not be anywhere near as extreme as 2^-100, and i seriously doubt that ronny’s would ever be anywhere near as extreme as 2^-10000; i think his whole point in the 2^-10k example was “there’s a naive-but-relevant model that say’s we’re super-duper fucked; the details of it causes me to think that we’re not in particulary good shape (though obviously not to that same level of credence)”.
but even saying that is sorta buying into a silly frame, i think. fundamentally, i was not trying to give odds for what would actually happen if you randomly sample models, i was trying to probe for a disagreement about the difference between the number of ways that a computer program can be (weighted by length), and the number of ways that a model can be (weighted by SGD-accessibility).
I don’t think there’s anything remotely resembling probabilistic reasoning going on here. I don’t know what it is, but I do want to point at it and be like “that! that reasoning is totally broken!”
(yeah, my guess is that you’re suffering from a fairly persistent reading comprehension hiccup when it comes to my text; perhaps the above can help not just in this case but in other cases, insofar as you can use this example to locate the hiccup and then generalize a solution)
my original 100:1 was a typo, where i meant 2^-100:1.
this number was in reference to ronny’s 2^-10000:1.
when ronny said:
i interpreted him to mean “i expect it takes 10k bits of description to nail down human values, and so if one is literally randomly sampling programs, they should naively expect 1:2^10000 odds against alignment”.
i personally think this is wrong, for reasons brought up later in the convo—namely, the relevant question is not how many bits is takes to specify human values relative to the python standard library; the relevant question is how many bits it takes to specify human values relative to the training observations.
but this was before i raised that objection, and my understanding of ronny’s position was something like “specifying human values (in full, without reference to the observations) probably takes ~10k bits in python, but for all i know it takes very few bits in ML models”. to which i was attempting to reply “man, i can see enough ways that ML models could turn out that i’m pretty sure it’d still take at least 100 bits”.
i inserted the hedge “in the very strongest sense” to stave off exactly your sort of objection; the very strongest sense of “alignment-by-default” is that you sample any old model that performs well on some task (without attempting alignment at all) and hope that it’s aligned (e.g. b/c maybe the human-ish way to perform well on tasks is the ~only way to perform well on tasks and so we find some great convergence); here i was trying to say something like “i think that i can see enough other ways to perform well on tasks that there’s e.g. at least ~33 knobs with at least ~10 settings such that you have to get them all right before the AI does something valuable with the stars”.
this was not meant to be an argument that alignment actually has odds less than 2^-100, for various reasons, including but not limited to: any attempt by humans to try at all takes you into a whole new regime; there’s more than a 2^-100 chance that there’s some correlation between the various knobs for some reason; and the odds of my being wrong about the biases of SGD are greater than 2^-100 (case-in-point: i think ronny was wrong about the 2^-100000 claim, on account of the point about the relevant number being relative to the observations).
my betting odds would not be anywhere near as extreme as 2^-100, and i seriously doubt that ronny’s would ever be anywhere near as extreme as 2^-10000; i think his whole point in the 2^-10k example was “there’s a naive-but-relevant model that say’s we’re super-duper fucked; the details of it causes me to think that we’re not in particulary good shape (though obviously not to that same level of credence)”.
but even saying that is sorta buying into a silly frame, i think. fundamentally, i was not trying to give odds for what would actually happen if you randomly sample models, i was trying to probe for a disagreement about the difference between the number of ways that a computer program can be (weighted by length), and the number of ways that a model can be (weighted by SGD-accessibility).
(yeah, my guess is that you’re suffering from a fairly persistent reading comprehension hiccup when it comes to my text; perhaps the above can help not just in this case but in other cases, insofar as you can use this example to locate the hiccup and then generalize a solution)