I struggle to recall another piece of technology that humans have built and yet understand less than AI models trained by deep learning. The statement that we have “no idea” seems completely appropriate. And I don’t think he’s trying to say that interpretability researchers are wasting their time by noticing that current state of affairs; the not knowing is why interpretability research is necessary in the first place.
I struggle to imagine another piece of technology that humans have built and yet understand less than AI models trained by deep learning.
I agree, in that we often have a lot less knowledge today about AI than we’d like, and we at least have partial knowledge, and in special cases can control the AI’s knowledge.
This is very much not this:
The statement that we have “no idea” seems completely appropriate.
We know that this is not right, at least the stronger claim.
This implies 2 things about Eliezer’s epistemics on AI:
Eliezer can’t update well on evidence at all, especially if it contradicts doom (in this case it’s not too much evidence against doom, but calling it zero evidence is inaccurate.)
Eliezer’s way overconfident on AI, and thus we should expect that if Eliezer is very confident in a specific outcome like say doom, it’s very likely that it’s due to bias.
Eliezer can’t update well on evidence at all, especially if it contradicts doom (in this case it’s not too much evidence against doom, but calling it zero evidence is inaccurate.)
I’ve noticed you repeating this claim in a number of threads, but I don’t think I’ve seen you present evidence sufficient to justify it. In particular, the last time I asked you about this, your response was basically premised on “I think current (weak) systems are going to analogize very well to stronger systems, and this analogy carries the weight of my entire argument.”
But if one denies the analogy (as I do, and as Eliezer presumably does), then that does indeed license him to update differently; in particular, it enables him to claim different conditional probabilities for the observations you put forth as evidence. You can’t (validly) criticize his updating procedure without first attacking that underlying point—which, as far as I can tell, boils down to essentially a matter of priors: you, for whatever reason, have a strong prior that experimental results from (extremely) weak systems will carry over to stronger systems, despite there being a whole host of informal arguments (many of which Eliezer made in the original Sequences) against this notion.
In summary: I disagree with the object-level claim, as well as the meta-level claim about epistemic assessment. Indeed, I would push strongly against interpreting mere disagreement as evidence of the irrationality of one’s opposition; that’s double-counting evidence. You have observed that someone disagrees with you, but until you know why they disagree, to immediately suggest, from there, that this disagreement must stem from incorrect updating procedure on their part, is to assume the conclusion.
While I agree that there are broader prior disagreements, I think that even if we isolate it to the question over whether Eliezer’s statement was correct, without baking in priors, the statement that we have no knowledge of AI because they’re inscrutable piles of numbers is verifiably wrong. To put it in Eliezer’s words, it’s a locally invalid argument, and this is known to to be false even without the broader prior disagreements.
One could honestly say the interpretability progress isn’t enough. One couldn’t honestly say that interpretability didn’t progress at all, or that we know nothing about AI internals at all without massive ignorance.
This is poor news for his epistemics, because note that this is a verifiably wrong statement that Eliezer keeps making without any caveats or limitations.
That’s a big problem because if Eliezer can be both making confidently locally invalid arguments on AI, and persistently makes that locally invalid argument, then it calls into question over how well his epistemics are working on AI, and from my perspective there are really only bad outcomes here.
It’s not that Eliezer’s wrong, it’s that he is persistently, confidently wrong about something that’s actually verifiable, such that we can point out the wrongness.
interpretability didn’t progress at all, or that we know nothing about AI internals at all
No to the former, yes to the latter—which is noteworthy because Eliezer only claimed the latter. That’s not a knock on interpretability research, when in fact Eliezer has repeatedly and publicly praised e.g. the work of Chis Olah and Distill. The choice to interpret the claim that we “know nothing about AI internals” as the claim that “no interpretability work has been done”, it should be pointed out, was a reading imposed by ShardPhoenix (and subsequently by you).
And in fact, we do still have approximately zero idea how large neural nets do what they do, interpretability research notwithstanding, as evinced by the fact that not a single person on this planet could code by hand whatever internal algorithms the models have learned. (The same is true of the brain, incidentally, which is why you sometimes hear people say “we have no idea how the brain works”, despite an insistently literal interpretation of this statement being falsified by the existence of neuroscience as a field.)
But it does, in fact, matter, whether the research into neural net interpretability translates to us knowing, in a real sense, what kind of work is going on inside large language models! That, ultimately, is the metric by which reality will judge us, not how many publications on interpretability were made (or how cool the results of said publications were—which, for the record, I think are very cool). And in light of that, I think it’s disingenuous to interpret Eliezer’s remark the way you and ShardPhoenix seem to be insisting on interpreting it in this thread.
And in fact, we do still have approximately zero idea how large neural nets do what they do, interpretability research notwithstanding, as evinced by the fact that not a single person on this planet could code by hand whatever internal algorithms the models have learned.
I now see where the problem lies. The basic issues I see with this argument are as follows:
The implied argument is if you can’t create something by yourself by hand in the field, you know nothing at all about what you are focusing on. This is straightforwardly not true for a lot of fields.
For example, I’d probably know quite a lot about borderlands 3, not perfectly, but I actually have quite a bit of knowledge, and I even could use save editors or cheatware with video tutorials, but under nearly 0 circumstances could I actually create borderlands 3 even if the game with it’s code already existed, even with a team.
This likely generalizes: while neuroscience has some knowledge of the brain, it’s not nearly at the point where it could reliably create a human brain from scratch, knowing some things about what cars do is not enough to create a working car, and so on.
In general, I think the error is that you and Eliezer have too high expectations of what some knowledge will bring you. It helps, but in virtually no cases will the knowledge alone allow you to create the thing you are focusing on.
It’s possible that our knowledge of the AI’s internal work isn’t enough, and that progress is too slow. I might agree or disagree, but at least this would be rational. Right now, I’m seeing basic locally invalid arguments here, and I notice that part of the problem is that you and Eliezer have too much of a binary view on knowledge, where you either have functionally perfect knowledge or no knowledge at all, but usually our knowledge is neither functionally perfect, nor is it zero knowledge.
Edit: This seems conceptually similar to P=NP, in that the problem is that verifying something and making something are conjectured to have very different difficulties, and essentially my claim is that verifying something isn’t equal to generating something.
I expect that if you sat down with him and had a one on one conversation, you’d find that he does have nuisances views. I also expect that Eliser realizes that there have been improvements in all of the areas you described. I think that the difference comes mostly down to “Has there been sufficient progress in interpretability to avert disaster?” I’m confident his answer would be “No.”
So, given that belief, and having a chance now and then to communicate with a wide audience, it is better to have a clear message, because you never know what will be a zeitgeist tipping point. It’s the fate of the world, so a little nuisance is just collateral damage.
I don’t know if that matters, because whether he’s pegged to Doom epistemically or strategically the result is the same.
Imagine if instead of developing them, we were like, “we need to stop here because we don’t understand EXACTLY how this works… and maybe for good measure we should bomb anyone who we think is continuing development, because it seems like transistors could be dangerous[1]”?
Claims that the software/networks are “unknown unknowns” which we have “no idea” about are patently false, inappropriate for a “rational” discourse, and basically just hyperbolic rhetoric. And to dismiss with a wave how draconian regulation (functionally/demonstrably impossible, re: cloning) of these software enigmas would need to be, while advocating bombardment of rouge datacenters?!?
Frankly I’m sad that it’s FUD that gets the likes here on LW— what with all it’s purported to be a bastion of.
I know for a fact there will be a lot of heads here who think this would have been FANTASTIC, since without transistors, we wouldn’t have created digital watches— which inevitably led to the creation of AI; the most likely outcome of which is inarguably ALL BIOLOGICAL LIFE ON EARTH DIES
No, it’s not, because we have a pretty good idea of how transistors work and in fact someone needed to directly anticipate how they might work in order to engineer them. The “unknown” part about the deep learning models is not the network layer or the software that uses the inscrutable matrices, it’s how the model is getting the answers that it does.
I think he’s referring to the understanding of the precise mechanics of how transistors worked, or why the particular first working prototypes functioned while all the others didn’t. Just from skimming https://en.wikipedia.org/wiki/History_of_the_transistor
That’s the current understanding for llms—people do know at a high level what an llm does and why it works, just like there were theories decades before working transistors on their function. But the details of why this system works but 50 other things tried didn’t is not known.
I struggle to recall another piece of technology that humans have built and yet understand less than AI models trained by deep learning. The statement that we have “no idea” seems completely appropriate. And I don’t think he’s trying to say that interpretability researchers are wasting their time by noticing that current state of affairs; the not knowing is why interpretability research is necessary in the first place.
I agree, in that we often have a lot less knowledge today about AI than we’d like, and we at least have partial knowledge, and in special cases can control the AI’s knowledge.
This is very much not this:
We know that this is not right, at least the stronger claim.
This implies 2 things about Eliezer’s epistemics on AI:
Eliezer can’t update well on evidence at all, especially if it contradicts doom (in this case it’s not too much evidence against doom, but calling it zero evidence is inaccurate.)
Eliezer’s way overconfident on AI, and thus we should expect that if Eliezer is very confident in a specific outcome like say doom, it’s very likely that it’s due to bias.
I’ve noticed you repeating this claim in a number of threads, but I don’t think I’ve seen you present evidence sufficient to justify it. In particular, the last time I asked you about this, your response was basically premised on “I think current (weak) systems are going to analogize very well to stronger systems, and this analogy carries the weight of my entire argument.”
But if one denies the analogy (as I do, and as Eliezer presumably does), then that does indeed license him to update differently; in particular, it enables him to claim different conditional probabilities for the observations you put forth as evidence. You can’t (validly) criticize his updating procedure without first attacking that underlying point—which, as far as I can tell, boils down to essentially a matter of priors: you, for whatever reason, have a strong prior that experimental results from (extremely) weak systems will carry over to stronger systems, despite there being a whole host of informal arguments (many of which Eliezer made in the original Sequences) against this notion.
In summary: I disagree with the object-level claim, as well as the meta-level claim about epistemic assessment. Indeed, I would push strongly against interpreting mere disagreement as evidence of the irrationality of one’s opposition; that’s double-counting evidence. You have observed that someone disagrees with you, but until you know why they disagree, to immediately suggest, from there, that this disagreement must stem from incorrect updating procedure on their part, is to assume the conclusion.
While I agree that there are broader prior disagreements, I think that even if we isolate it to the question over whether Eliezer’s statement was correct, without baking in priors, the statement that we have no knowledge of AI because they’re inscrutable piles of numbers is verifiably wrong. To put it in Eliezer’s words, it’s a locally invalid argument, and this is known to to be false even without the broader prior disagreements.
One could honestly say the interpretability progress isn’t enough. One couldn’t honestly say that interpretability didn’t progress at all, or that we know nothing about AI internals at all without massive ignorance.
This is poor news for his epistemics, because note that this is a verifiably wrong statement that Eliezer keeps making without any caveats or limitations.
That’s a big problem because if Eliezer can be both making confidently locally invalid arguments on AI, and persistently makes that locally invalid argument, then it calls into question over how well his epistemics are working on AI, and from my perspective there are really only bad outcomes here.
It’s not that Eliezer’s wrong, it’s that he is persistently, confidently wrong about something that’s actually verifiable, such that we can point out the wrongness.
No to the former, yes to the latter—which is noteworthy because Eliezer only claimed the latter. That’s not a knock on interpretability research, when in fact Eliezer has repeatedly and publicly praised e.g. the work of Chis Olah and Distill. The choice to interpret the claim that we “know nothing about AI internals” as the claim that “no interpretability work has been done”, it should be pointed out, was a reading imposed by ShardPhoenix (and subsequently by you).
And in fact, we do still have approximately zero idea how large neural nets do what they do, interpretability research notwithstanding, as evinced by the fact that not a single person on this planet could code by hand whatever internal algorithms the models have learned. (The same is true of the brain, incidentally, which is why you sometimes hear people say “we have no idea how the brain works”, despite an insistently literal interpretation of this statement being falsified by the existence of neuroscience as a field.)
But it does, in fact, matter, whether the research into neural net interpretability translates to us knowing, in a real sense, what kind of work is going on inside large language models! That, ultimately, is the metric by which reality will judge us, not how many publications on interpretability were made (or how cool the results of said publications were—which, for the record, I think are very cool). And in light of that, I think it’s disingenuous to interpret Eliezer’s remark the way you and ShardPhoenix seem to be insisting on interpreting it in this thread.
I now see where the problem lies. The basic issues I see with this argument are as follows:
The implied argument is if you can’t create something by yourself by hand in the field, you know nothing at all about what you are focusing on. This is straightforwardly not true for a lot of fields.
For example, I’d probably know quite a lot about borderlands 3, not perfectly, but I actually have quite a bit of knowledge, and I even could use save editors or cheatware with video tutorials, but under nearly 0 circumstances could I actually create borderlands 3 even if the game with it’s code already existed, even with a team.
This likely generalizes: while neuroscience has some knowledge of the brain, it’s not nearly at the point where it could reliably create a human brain from scratch, knowing some things about what cars do is not enough to create a working car, and so on.
In general, I think the error is that you and Eliezer have too high expectations of what some knowledge will bring you. It helps, but in virtually no cases will the knowledge alone allow you to create the thing you are focusing on.
It’s possible that our knowledge of the AI’s internal work isn’t enough, and that progress is too slow. I might agree or disagree, but at least this would be rational. Right now, I’m seeing basic locally invalid arguments here, and I notice that part of the problem is that you and Eliezer have too much of a binary view on knowledge, where you either have functionally perfect knowledge or no knowledge at all, but usually our knowledge is neither functionally perfect, nor is it zero knowledge.
Edit: This seems conceptually similar to P=NP, in that the problem is that verifying something and making something are conjectured to have very different difficulties, and essentially my claim is that verifying something isn’t equal to generating something.
I expect that if you sat down with him and had a one on one conversation, you’d find that he does have nuisances views. I also expect that Eliser realizes that there have been improvements in all of the areas you described. I think that the difference comes mostly down to “Has there been sufficient progress in interpretability to avert disaster?” I’m confident his answer would be “No.”
So, given that belief, and having a chance now and then to communicate with a wide audience, it is better to have a clear message, because you never know what will be a zeitgeist tipping point. It’s the fate of the world, so a little nuisance is just collateral damage.
I don’t know if that matters, because whether he’s pegged to Doom epistemically or strategically the result is the same.
The transistor is a neat example.
Imagine if instead of developing them, we were like, “we need to stop here because we don’t understand EXACTLY how this works… and maybe for good measure we should bomb anyone who we think is continuing development, because it seems like transistors could be dangerous[1]”?
Claims that the software/networks are “unknown unknowns” which we have “no idea” about are patently false, inappropriate for a “rational” discourse, and basically just hyperbolic rhetoric. And to dismiss with a wave how draconian regulation (functionally/demonstrably impossible, re: cloning) of these software enigmas would need to be, while advocating bombardment of rouge datacenters?!?
Frankly I’m sad that it’s FUD that gets the likes here on LW— what with all it’s purported to be a bastion of.
I know for a fact there will be a lot of heads here who think this would have been FANTASTIC, since without transistors, we wouldn’t have created digital watches— which inevitably led to the creation of AI; the most likely outcome of which is inarguably ALL BIOLOGICAL LIFE ON EARTH DIES
No, it’s not, because we have a pretty good idea of how transistors work and in fact someone needed to directly anticipate how they might work in order to engineer them. The “unknown” part about the deep learning models is not the network layer or the software that uses the inscrutable matrices, it’s how the model is getting the answers that it does.
Yes, it is, because it took like five years to understand minority-carrier injection.
I think he’s referring to the understanding of the precise mechanics of how transistors worked, or why the particular first working prototypes functioned while all the others didn’t. Just from skimming https://en.wikipedia.org/wiki/History_of_the_transistor
That’s the current understanding for llms—people do know at a high level what an llm does and why it works, just like there were theories decades before working transistors on their function. But the details of why this system works but 50 other things tried didn’t is not known.