Historically you very clearly thought that a major part of the problem is that AIs would not understand human concepts and preferences until after or possibly very slightly before achieving superintelligence. This is not how it seems to have gone.
Everyone agrees that you assumed superintelligence would understand everything humans understand and more. The dispute is entirely about the things that you encounter before superintelligence. In general it seems like the world turned out much more gradual than you expected and there’s information to be found in what capabilities emerged sooner in the process.
AI happening through deep learning at all is a huge update against alignment success, because deep learning is incredibly opaque. LLMs possibly ending up at the center is a small update in favor of alignment success, because it means we might (through some clever sleight, this part is not trivial) be able to have humanese sentences play an inextricable role at the center of thought (hence MIRI’s early interest in the Visible Thoughts Project).
The part where LLMs are to predict English answers to some English questions about values, and show common-sense relative to their linguistic shadow of the environment as it was presented to them by humans within an Internet corpus, is not actually very much hope because a sane approach doesn’t involve trying to promote an LLM’s predictive model of human discourse about morality to be in charge of a superintelligence’s dominion of the galaxy. What you would like to promote to values are concepts like “corrigibility”, eg “low impact” or “soft optimization”, which aren’t part of everyday human life and aren’t in the training set because humans do not have those values.
It seems like those goals are all in the training set, because humans talk about those concepts. Corrigibility is elaborations of “make sure you keep doing what these people say”, etc.
It seems like you could simply use an LLM’s knowledge of concepts to define alignment goals, at least to a first approximation. I review one such proposal here. There’s still an important question about how perfectly that knowledge generalizes with continued learning, and to OOD future contexts. But almost no one is talking about those questions. Many are still saying “we have no idea how to define human values”, when LLMs can capture much of any definition you like.
AI happening through deep learning at all is a huge update against alignment success, because deep learning is incredibly opaque
This is wrong, and this disagreement is at a very deep level why I think on the object level that LW was wrong.
AIs are white boxes, not black boxes, because we have full read-write access to their internals, which is partially why AI is so effective today. We are the innate reward system, which already aligns our brain to survival and critically doing all of this with almost no missteps, and the missteps aren’t very severe.
The meme of AI as black box needs to die.
These posts can help you get better intuitions, at least:
The fact that we have access to AI internals does not mean we understand them. We refer to them as black boxes because we do not understand how their internals produce their answers; this is, so to speak, opaque to us.
Historically you very clearly thought that a major part of the problem is that AIs would not understand human concepts and preferences until after or possibly very slightly before achieving superintelligence. This is not how it seems to have gone.
“You very clearly thought that was a major part of the problem” implies that if you could go to Eliezer-2008 and convince him “we’re going to solve a lot of NLP a bunch of years before we get to ASI”, he would respond with some version of “oh great, that solves a major part of the problem!”. Which I’m pretty sure is false.
In order for GPT-4 (or GPT-2) to be a major optimistic update about alignment, there needs to be a way to leverage “really good NLP” to help with alignment. I think the crux of disagreement is that you think really-good-NLP is obviously super helpful for alignment and should be a big positive update, and Eliezer and Nate and I disagree.
Maybe a good starting point would be for you to give examples of concrete ways you expect really good NLP to put humanity in a better position to wield superintelligence, e.g., if superintelligence is 8 years away?
(Or say some other update we should be making on the basis of “really good NLP today”, like “therefore we’ll probably unlock this other capability X well before ASI, and X likely makes alignment a lot easier via concrete pathway Y”.)
To pick a toy example, you can use text as a bottleneck to force systems to “think out loud” in a way which will be very directly interpretable by a human reader, and because language understanding is so rich this will actually be competitive with other approaches and often superior.
I’m sure you can come up with more ways that the existence of software that understands language and does ~nothing else makes getting computers to do what you mean easier than if software did not understand language. Please think about the problem for 5 minutes. Use a clock.
Are you claiming that this example solves “a major part of the problem” of alignment? Or that, e.g., this plus four other easy ideas solve a major part of the problem of alignment?
Examples like the Visible Thoughts Project show that MIRI has been interested in research directions that leverage recent NLP progress to try to make inroads on alignment. But Matthew’s claim seems to be ‘systems like GPT-4 are grounds for being a lot more optimistic about alignment’, and your claim is that systems like these solve “a major part of the problem”. Which is different from thinking ‘NLP opens up some new directions for research that have a nontrivial chance of being at least a tiny bit useful, but doesn’t crack open the problem in any major way’.
It’s not a coincidence that MIRI has historically worked on problems related to AGI analyzability / understandability / interpretability, rather than working on NLP or machine ethics. We’ve pretty consistently said that:
The main problems lie in ‘we can safely and reliably aim ASI at a specific goal at all’.
The problem of going from ‘we can aim the AI at a goal at all’ to ‘we can aim the AI at the right goal (e.g., corrigibly inventing nanotech)’ is a smaller but nontrivial additional step.
… Whereas I don’t think we’ve ever suggested that good NLP AI would take a major bite out of either of those problems. The latter problem isn’t equivalent to (or an obvious result of) ‘get the AI to understand corrigibility and nanotech’, or for that matter ‘get the AI to understand human preferences in general’.
I do not necessarily disagree or agree, but I do not know which source you derive “very clearly” from. So do you have any memory which could help me locate that text?
I think controlling Earth’s destiny is only modestly harder than understanding a sentence in English.
Well said. I shall have to try to remember that tagline.
I think this provides some support for the claim, “Historically [Eliezer] very clearly thought that a major part of the problem is that AIs would not understand human concepts and preferences until after or possibly very slightly before achieving superintelligence.” At the very least, the two claims are consistent.
??? What?? It’s fine to say that this is a falsified prediction, but how does “Eliezer expected less NLP progress pre-ASI” provide support for “Eliezer thinks solving NLP is a major part of the alignment problem”?
I continue to be baffled at the way you’re doing exegesis here, happily running with extremely tenuous evidence for P while dismissing contemporary evidence for not-P, and seeming unconcerned about the fact that Eliezer and Nate apparently managed to secretly believe P for many years without ever just saying it outright, and seeming equally unconcerned about the fact that Eliezer and Nate keep saying that your interpretation of what they said is wrong. (Which I also vouch for from having worked with them for ten years, separate from the giant list of specific arguments I’ve made. Good grief.)
At the very least, the two claims are consistent.
?? “Consistent” is very different from “supports”! Every off-topic claim by EY is “consistent” with Gallabytes’ assertion.
??? What?? It’s fine to say that this is a falsified prediction, but how does “Eliezer expected less NLP progress pre-ASI” provide support for “Eliezer thinks solving NLP is a major part of the alignment problem”?
ETA: first of all, the claim was “Historically [Eliezer] very clearly thought that a major part of the problem is that AIs would not understand human concepts and preferences until after or possibly very slightly before achieving superintelligence.” which is semantically different than “Eliezer thinks solving NLP is a major part of the alignment problem”.
All I said is that it provides “some support” and I hedged in the next sentence. I don’t think it totally vindicates the claim. However, I think the fact that Eliezer seems to have not expected NLP to be solved until very late might easily explain why he illustrated alignment using stories like a genie throwing your mother out of a building because you asked to get your mother away from the building. Do you really disagree?
I continue to be baffled at the way you’re doing exegesis here, happily running with extremely tenuous evidence for P while dismissing contemporary evidence for not-P, and seeming unconcerned about the fact that Eliezer and Nate apparently managed to secretly believe X for many years without ever just saying it outright, and seeming equally unconcerned about the fact that Eliezer and Nate keep saying that your interpretation of what they said is wrong.
This was one case, and I said “some support”. The evidence in my post was quite a bit stronger IMO. Basically all the statements I made about how MIRI thought value specification would both be hard and an important part of alignment are supported by straightforward quotations. The real debate mostly seems to comes down to whether by “value specification” MIRI people were including problems of inner alignment, which seems implausible to me, and at least ambiguous even under very charitable interpretations.
By contrast, you, Eliezer, and Nate all flagrantly misinterpreted me as saying that MIRI people thought that AI wouldn’t understand human values even though I explicitly and very clearly said otherwise in the post more than once. I see these as larger errors than me misinterpreting Eliezer in this narrow case.
This would make more sense if LLMs were directly selected for predicting preferences, which they aren’t. (RLHF tries to bridge the gap, but this apparently breaks GPT’s ability to play chess—though I’ll grant the surprise here is that it works at all.) LLMs are primarily selected to predict human text or speech. Now, I’m happy to assume that if we gave humans a D&D-style boost to all mental abilities, each of us would create a coherent set of preferences from our inconsistent desires, which vary and may conflict at a given time even within an individual. Such augmented humans could choose to express their true preferences, though they still might not. If we gave that idealized solution to LLMs, it would just boost their ability to predict what humans or augmented humans would say. The augmented-LLM wouldn’t automatically care about the augmented-human’s true values.
While we can loosely imagine asking LLMs to give the commands that an augmented version of us would give, that seems to require actually knowing how to specify how a D&D ability-boost would work for humans—which will only resemble the same boost for AI at an abstract mathematical level, if at all. It seems to take us back to the CEV problem of explaining how extrapolation works. Without being able to do that, we’d just be hoping a better LLM would look at our inconsistent use of words like “smarter,” and pick the out-of-distribution meaning we want, for cases which have mostly never existed. This is a lot like what “Complexity of Wishes” was trying to get at, as well as the longstanding arguments against CEV. Vaniver’s comment seems to point in this same direction.
Now, I do think recent results are some evidence that alignment would be easier for a Manhattan Project to solve. It doesn’t follow that we’re on track to solve it.
Historically you very clearly thought that a major part of the problem is that AIs would not understand human concepts and preferences until after or possibly very slightly before achieving superintelligence. This is not how it seems to have gone.
Everyone agrees that you assumed superintelligence would understand everything humans understand and more. The dispute is entirely about the things that you encounter before superintelligence. In general it seems like the world turned out much more gradual than you expected and there’s information to be found in what capabilities emerged sooner in the process.
AI happening through deep learning at all is a huge update against alignment success, because deep learning is incredibly opaque. LLMs possibly ending up at the center is a small update in favor of alignment success, because it means we might (through some clever sleight, this part is not trivial) be able to have humanese sentences play an inextricable role at the center of thought (hence MIRI’s early interest in the Visible Thoughts Project).
The part where LLMs are to predict English answers to some English questions about values, and show common-sense relative to their linguistic shadow of the environment as it was presented to them by humans within an Internet corpus, is not actually very much hope because a sane approach doesn’t involve trying to promote an LLM’s predictive model of human discourse about morality to be in charge of a superintelligence’s dominion of the galaxy. What you would like to promote to values are concepts like “corrigibility”, eg “low impact” or “soft optimization”, which aren’t part of everyday human life and aren’t in the training set because humans do not have those values.
It seems like those goals are all in the training set, because humans talk about those concepts. Corrigibility is elaborations of “make sure you keep doing what these people say”, etc.
It seems like you could simply use an LLM’s knowledge of concepts to define alignment goals, at least to a first approximation. I review one such proposal here. There’s still an important question about how perfectly that knowledge generalizes with continued learning, and to OOD future contexts. But almost no one is talking about those questions. Many are still saying “we have no idea how to define human values”, when LLMs can capture much of any definition you like.
I want to note that this part:
This is wrong, and this disagreement is at a very deep level why I think on the object level that LW was wrong.
AIs are white boxes, not black boxes, because we have full read-write access to their internals, which is partially why AI is so effective today. We are the innate reward system, which already aligns our brain to survival and critically doing all of this with almost no missteps, and the missteps aren’t very severe.
The meme of AI as black box needs to die.
These posts can help you get better intuitions, at least:
https://forum.effectivealtruism.org/posts/JYEAL8g7ArqGoTaX6/ai-pause-will-likely-backfire#White_box_alignment_in_nature
The fact that we have access to AI internals does not mean we understand them. We refer to them as black boxes because we do not understand how their internals produce their answers; this is, so to speak, opaque to us.
“You very clearly thought that was a major part of the problem” implies that if you could go to Eliezer-2008 and convince him “we’re going to solve a lot of NLP a bunch of years before we get to ASI”, he would respond with some version of “oh great, that solves a major part of the problem!”. Which I’m pretty sure is false.
In order for GPT-4 (or GPT-2) to be a major optimistic update about alignment, there needs to be a way to leverage “really good NLP” to help with alignment. I think the crux of disagreement is that you think really-good-NLP is obviously super helpful for alignment and should be a big positive update, and Eliezer and Nate and I disagree.
Maybe a good starting point would be for you to give examples of concrete ways you expect really good NLP to put humanity in a better position to wield superintelligence, e.g., if superintelligence is 8 years away?
(Or say some other update we should be making on the basis of “really good NLP today”, like “therefore we’ll probably unlock this other capability X well before ASI, and X likely makes alignment a lot easier via concrete pathway Y”.)
To pick a toy example, you can use text as a bottleneck to force systems to “think out loud” in a way which will be very directly interpretable by a human reader, and because language understanding is so rich this will actually be competitive with other approaches and often superior.
I’m sure you can come up with more ways that the existence of software that understands language and does ~nothing else makes getting computers to do what you mean easier than if software did not understand language. Please think about the problem for 5 minutes. Use a clock.
I appreciate the example!
Are you claiming that this example solves “a major part of the problem” of alignment? Or that, e.g., this plus four other easy ideas solve a major part of the problem of alignment?
Examples like the Visible Thoughts Project show that MIRI has been interested in research directions that leverage recent NLP progress to try to make inroads on alignment. But Matthew’s claim seems to be ‘systems like GPT-4 are grounds for being a lot more optimistic about alignment’, and your claim is that systems like these solve “a major part of the problem”. Which is different from thinking ‘NLP opens up some new directions for research that have a nontrivial chance of being at least a tiny bit useful, but doesn’t crack open the problem in any major way’.
It’s not a coincidence that MIRI has historically worked on problems related to AGI analyzability / understandability / interpretability, rather than working on NLP or machine ethics. We’ve pretty consistently said that:
The main problems lie in ‘we can safely and reliably aim ASI at a specific goal at all’.
The problem of going from ‘we can aim the AI at a goal at all’ to ‘we can aim the AI at the right goal (e.g., corrigibly inventing nanotech)’ is a smaller but nontrivial additional step.
… Whereas I don’t think we’ve ever suggested that good NLP AI would take a major bite out of either of those problems. The latter problem isn’t equivalent to (or an obvious result of) ‘get the AI to understand corrigibility and nanotech’, or for that matter ‘get the AI to understand human preferences in general’.
I do not necessarily disagree or agree, but I do not know which source you derive “very clearly” from. So do you have any memory which could help me locate that text?
Here’s a comment from Eliezer in 2010,
I think this provides some support for the claim, “Historically [Eliezer] very clearly thought that a major part of the problem is that AIs would not understand human concepts and preferences until after or possibly very slightly before achieving superintelligence.” At the very least, the two claims are consistent.
??? What?? It’s fine to say that this is a falsified prediction, but how does “Eliezer expected less NLP progress pre-ASI” provide support for “Eliezer thinks solving NLP is a major part of the alignment problem”?
I continue to be baffled at the way you’re doing exegesis here, happily running with extremely tenuous evidence for P while dismissing contemporary evidence for not-P, and seeming unconcerned about the fact that Eliezer and Nate apparently managed to secretly believe P for many years without ever just saying it outright, and seeming equally unconcerned about the fact that Eliezer and Nate keep saying that your interpretation of what they said is wrong. (Which I also vouch for from having worked with them for ten years, separate from the giant list of specific arguments I’ve made. Good grief.)
?? “Consistent” is very different from “supports”! Every off-topic claim by EY is “consistent” with Gallabytes’ assertion.
ETA: first of all, the claim was “Historically [Eliezer] very clearly thought that a major part of the problem is that AIs would not understand human concepts and preferences until after or possibly very slightly before achieving superintelligence.” which is semantically different than “Eliezer thinks solving NLP is a major part of the alignment problem”.
All I said is that it provides “some support” and I hedged in the next sentence. I don’t think it totally vindicates the claim. However, I think the fact that Eliezer seems to have not expected NLP to be solved until very late might easily explain why he illustrated alignment using stories like a genie throwing your mother out of a building because you asked to get your mother away from the building. Do you really disagree?
This was one case, and I said “some support”. The evidence in my post was quite a bit stronger IMO. Basically all the statements I made about how MIRI thought value specification would both be hard and an important part of alignment are supported by straightforward quotations. The real debate mostly seems to comes down to whether by “value specification” MIRI people were including problems of inner alignment, which seems implausible to me, and at least ambiguous even under very charitable interpretations.
By contrast, you, Eliezer, and Nate all flagrantly misinterpreted me as saying that MIRI people thought that AI wouldn’t understand human values even though I explicitly and very clearly said otherwise in the post more than once. I see these as larger errors than me misinterpreting Eliezer in this narrow case.
This would make more sense if LLMs were directly selected for predicting preferences, which they aren’t. (RLHF tries to bridge the gap, but this apparently breaks GPT’s ability to play chess—though I’ll grant the surprise here is that it works at all.) LLMs are primarily selected to predict human text or speech. Now, I’m happy to assume that if we gave humans a D&D-style boost to all mental abilities, each of us would create a coherent set of preferences from our inconsistent desires, which vary and may conflict at a given time even within an individual. Such augmented humans could choose to express their true preferences, though they still might not. If we gave that idealized solution to LLMs, it would just boost their ability to predict what humans or augmented humans would say. The augmented-LLM wouldn’t automatically care about the augmented-human’s true values.
While we can loosely imagine asking LLMs to give the commands that an augmented version of us would give, that seems to require actually knowing how to specify how a D&D ability-boost would work for humans—which will only resemble the same boost for AI at an abstract mathematical level, if at all. It seems to take us back to the CEV problem of explaining how extrapolation works. Without being able to do that, we’d just be hoping a better LLM would look at our inconsistent use of words like “smarter,” and pick the out-of-distribution meaning we want, for cases which have mostly never existed. This is a lot like what “Complexity of Wishes” was trying to get at, as well as the longstanding arguments against CEV. Vaniver’s comment seems to point in this same direction.
Now, I do think recent results are some evidence that alignment would be easier for a Manhattan Project to solve. It doesn’t follow that we’re on track to solve it.