ehh it feels to me like i can get you more than 100:1 against alignment by default in the very strongest sense; i feel like my knowledge of possible mind architectures (and my awareness of stochastic gradient descent-accessible shortcut-hacks) rules out “naive training leads to friendly AIs”
probably more extreme than 2^-100:1, is my guess
What is the 2^-100:1 part intended to mean? Was it a correction to the 100:1 part or a different claim? Seems like an incredibly low probability.
Separately:
Ronny Fernandez
I’m like look, I used to think the chances of alignment by default were like 2^-10000:1
I still think I can do this if we’re searching over python programs
This seems straightforwardly insane to me, in a way that is maybe instructive. Ronnie has updated from an odds ratio of 2^-10000:1 to one that is (implicitly) thousands of orders of magnitude different, which should essentially never happen. Ronnie has just admitted to being more wrong than practically anyone who has ever tried to give a credence. And then, rather than being like “something about the process by which I generate 2^-10000:1 chances is utterly broken”, he just.… made another claim of the same form?
I don’t think there’s anything remotely resembling probabilistic reasoning going on here. I don’t know what it is, but I do want to point at it and be like “that! that reasoning is totally broken!” And it seems to me that people who assign P(doom) > 90% are displaying a related (but far less extreme) phenomenon. (My posts about meta-rationality are probably my best attempt at actually pinning this phenomenon down, but I don’t think I’ve done a great job so far.)
my original 100:1 was a typo, where i meant 2^-100:1.
this number was in reference to ronny’s 2^-10000:1.
when ronny said:
I’m like look, I used to think the chances of alignment by default were like 2^-10000:1
i interpreted him to mean “i expect it takes 10k bits of description to nail down human values, and so if one is literally randomly sampling programs, they should naively expect 1:2^10000 odds against alignment”.
i personally think this is wrong, for reasons brought up later in the convo—namely, the relevant question is not how many bits is takes to specify human values relative to the python standard library; the relevant question is how many bits it takes to specify human values relative to the training observations.
but this was before i raised that objection, and my understanding of ronny’s position was something like “specifying human values (in full, without reference to the observations) probably takes ~10k bits in python, but for all i know it takes very few bits in ML models”. to which i was attempting to reply “man, i can see enough ways that ML models could turn out that i’m pretty sure it’d still take at least 100 bits”.
i inserted the hedge “in the very strongest sense” to stave off exactly your sort of objection; the very strongest sense of “alignment-by-default” is that you sample any old model that performs well on some task (without attempting alignment at all) and hope that it’s aligned (e.g. b/c maybe the human-ish way to perform well on tasks is the ~only way to perform well on tasks and so we find some great convergence); here i was trying to say something like “i think that i can see enough other ways to perform well on tasks that there’s e.g. at least ~33 knobs with at least ~10 settings such that you have to get them all right before the AI does something valuable with the stars”.
this was not meant to be an argument that alignment actually has odds less than 2^-100, for various reasons, including but not limited to: any attempt by humans to try at all takes you into a whole new regime; there’s more than a 2^-100 chance that there’s some correlation between the various knobs for some reason; and the odds of my being wrong about the biases of SGD are greater than 2^-100 (case-in-point: i think ronny was wrong about the 2^-100000 claim, on account of the point about the relevant number being relative to the observations).
my betting odds would not be anywhere near as extreme as 2^-100, and i seriously doubt that ronny’s would ever be anywhere near as extreme as 2^-10000; i think his whole point in the 2^-10k example was “there’s a naive-but-relevant model that say’s we’re super-duper fucked; the details of it causes me to think that we’re not in particulary good shape (though obviously not to that same level of credence)”.
but even saying that is sorta buying into a silly frame, i think. fundamentally, i was not trying to give odds for what would actually happen if you randomly sample models, i was trying to probe for a disagreement about the difference between the number of ways that a computer program can be (weighted by length), and the number of ways that a model can be (weighted by SGD-accessibility).
I don’t think there’s anything remotely resembling probabilistic reasoning going on here. I don’t know what it is, but I do want to point at it and be like “that! that reasoning is totally broken!”
(yeah, my guess is that you’re suffering from a fairly persistent reading comprehension hiccup when it comes to my text; perhaps the above can help not just in this case but in other cases, insofar as you can use this example to locate the hiccup and then generalize a solution)
(I obviously don’t speak for Ronny)
I’d guess this is kinda the within-model uncertainty, he had a model of “alignment” that said you needed to specify all 10,000 bits of human values. And so the odds of doing this by default/at random was 2^-10000:1. But this doesn’t contain the uncertainty that this model is wrong, which would make the within-model uncertainty a rounding error.
According to this model there is effectively no chance of alignment by default, but this model could be wrong.
If Ronnie had said “there is one half-baked heuristic that claims that the probability is 2^-10000” then I would be sympathetic. That seems very different to what he said, though. In some sense my objection is precisely to people giving half-baked heuristics and intuitions an amount of decision-making weight far disproportionate to their quality, by calling them “models” and claiming that the resulting credences should be taken seriously.
I think that would be a more on-point objection to make on a single-author post, but this is a chat log between two people, optimized to communicate to each other, and as such generally comes with fewer caveats and taking a bunch of implicit things for granted (this makes it well-suited for some kind of communication, and not others). I like it in that it helps me get a much better sense of a bunch of underlying intuitions and hunches that are often hard to formalize and so rarely make it into posts, but I also think it is sometimes frustrating because it’s not optimized to be responded to.
I would take bets that Ronny’s position was always something closer to “I had this robust-seeming inside-view argument that claimed the probability was extremely low, though of course my outside view and different levels of uncertainty caused my betting odds to be quite different”.
I don’t see why it should rule out high probability of doom for some folks who present themselves as having good epistemics to actually be quite bad at picking up new models and stuck in an old, limiting paradigm, refusing to investigate new things properly because of believing themselves to already know. It certainly does weaken appeals to their authority, but the reasoning stands on its own, to the degree it’s actually specified using valid and relevant formal claims.
What is the 2^-100:1 part intended to mean? Was it a correction to the 100:1 part or a different claim? Seems like an incredibly low probability.
Separately:
This seems straightforwardly insane to me, in a way that is maybe instructive. Ronnie has updated from an odds ratio of 2^-10000:1 to one that is (implicitly) thousands of orders of magnitude different, which should essentially never happen. Ronnie has just admitted to being more wrong than practically anyone who has ever tried to give a credence. And then, rather than being like “something about the process by which I generate 2^-10000:1 chances is utterly broken”, he just.… made another claim of the same form?
I don’t think there’s anything remotely resembling probabilistic reasoning going on here. I don’t know what it is, but I do want to point at it and be like “that! that reasoning is totally broken!” And it seems to me that people who assign P(doom) > 90% are displaying a related (but far less extreme) phenomenon. (My posts about meta-rationality are probably my best attempt at actually pinning this phenomenon down, but I don’t think I’ve done a great job so far.)
my original 100:1 was a typo, where i meant 2^-100:1.
this number was in reference to ronny’s 2^-10000:1.
when ronny said:
i interpreted him to mean “i expect it takes 10k bits of description to nail down human values, and so if one is literally randomly sampling programs, they should naively expect 1:2^10000 odds against alignment”.
i personally think this is wrong, for reasons brought up later in the convo—namely, the relevant question is not how many bits is takes to specify human values relative to the python standard library; the relevant question is how many bits it takes to specify human values relative to the training observations.
but this was before i raised that objection, and my understanding of ronny’s position was something like “specifying human values (in full, without reference to the observations) probably takes ~10k bits in python, but for all i know it takes very few bits in ML models”. to which i was attempting to reply “man, i can see enough ways that ML models could turn out that i’m pretty sure it’d still take at least 100 bits”.
i inserted the hedge “in the very strongest sense” to stave off exactly your sort of objection; the very strongest sense of “alignment-by-default” is that you sample any old model that performs well on some task (without attempting alignment at all) and hope that it’s aligned (e.g. b/c maybe the human-ish way to perform well on tasks is the ~only way to perform well on tasks and so we find some great convergence); here i was trying to say something like “i think that i can see enough other ways to perform well on tasks that there’s e.g. at least ~33 knobs with at least ~10 settings such that you have to get them all right before the AI does something valuable with the stars”.
this was not meant to be an argument that alignment actually has odds less than 2^-100, for various reasons, including but not limited to: any attempt by humans to try at all takes you into a whole new regime; there’s more than a 2^-100 chance that there’s some correlation between the various knobs for some reason; and the odds of my being wrong about the biases of SGD are greater than 2^-100 (case-in-point: i think ronny was wrong about the 2^-100000 claim, on account of the point about the relevant number being relative to the observations).
my betting odds would not be anywhere near as extreme as 2^-100, and i seriously doubt that ronny’s would ever be anywhere near as extreme as 2^-10000; i think his whole point in the 2^-10k example was “there’s a naive-but-relevant model that say’s we’re super-duper fucked; the details of it causes me to think that we’re not in particulary good shape (though obviously not to that same level of credence)”.
but even saying that is sorta buying into a silly frame, i think. fundamentally, i was not trying to give odds for what would actually happen if you randomly sample models, i was trying to probe for a disagreement about the difference between the number of ways that a computer program can be (weighted by length), and the number of ways that a model can be (weighted by SGD-accessibility).
(yeah, my guess is that you’re suffering from a fairly persistent reading comprehension hiccup when it comes to my text; perhaps the above can help not just in this case but in other cases, insofar as you can use this example to locate the hiccup and then generalize a solution)
(I obviously don’t speak for Ronny) I’d guess this is kinda the within-model uncertainty, he had a model of “alignment” that said you needed to specify all 10,000 bits of human values. And so the odds of doing this by default/at random was 2^-10000:1. But this doesn’t contain the uncertainty that this model is wrong, which would make the within-model uncertainty a rounding error.
According to this model there is effectively no chance of alignment by default, but this model could be wrong.
If Ronnie had said “there is one half-baked heuristic that claims that the probability is 2^-10000” then I would be sympathetic. That seems very different to what he said, though. In some sense my objection is precisely to people giving half-baked heuristics and intuitions an amount of decision-making weight far disproportionate to their quality, by calling them “models” and claiming that the resulting credences should be taken seriously.
I think that would be a more on-point objection to make on a single-author post, but this is a chat log between two people, optimized to communicate to each other, and as such generally comes with fewer caveats and taking a bunch of implicit things for granted (this makes it well-suited for some kind of communication, and not others). I like it in that it helps me get a much better sense of a bunch of underlying intuitions and hunches that are often hard to formalize and so rarely make it into posts, but I also think it is sometimes frustrating because it’s not optimized to be responded to.
I would take bets that Ronny’s position was always something closer to “I had this robust-seeming inside-view argument that claimed the probability was extremely low, though of course my outside view and different levels of uncertainty caused my betting odds to be quite different”.
I meant the reasonable thing other people knew I meant and not the deranged thing you thought I might’ve meant.
In retrospect I think the above was insufficiently cooperative. Sorry,
I don’t see why it should rule out high probability of doom for some folks who present themselves as having good epistemics to actually be quite bad at picking up new models and stuck in an old, limiting paradigm, refusing to investigate new things properly because of believing themselves to already know. It certainly does weaken appeals to their authority, but the reasoning stands on its own, to the degree it’s actually specified using valid and relevant formal claims.