AI happening through deep learning at all is a huge update against alignment success, because deep learning is incredibly opaque. LLMs possibly ending up at the center is a small update in favor of alignment success, because it means we might (through some clever sleight, this part is not trivial) be able to have humanese sentences play an inextricable role at the center of thought (hence MIRI’s early interest in the Visible Thoughts Project).
The part where LLMs are to predict English answers to some English questions about values, and show common-sense relative to their linguistic shadow of the environment as it was presented to them by humans within an Internet corpus, is not actually very much hope because a sane approach doesn’t involve trying to promote an LLM’s predictive model of human discourse about morality to be in charge of a superintelligence’s dominion of the galaxy. What you would like to promote to values are concepts like “corrigibility”, eg “low impact” or “soft optimization”, which aren’t part of everyday human life and aren’t in the training set because humans do not have those values.
It seems like those goals are all in the training set, because humans talk about those concepts. Corrigibility is elaborations of “make sure you keep doing what these people say”, etc.
It seems like you could simply use an LLM’s knowledge of concepts to define alignment goals, at least to a first approximation. I review one such proposal here. There’s still an important question about how perfectly that knowledge generalizes with continued learning, and to OOD future contexts. But almost no one is talking about those questions. Many are still saying “we have no idea how to define human values”, when LLMs can capture much of any definition you like.
AI happening through deep learning at all is a huge update against alignment success, because deep learning is incredibly opaque
This is wrong, and this disagreement is at a very deep level why I think on the object level that LW was wrong.
AIs are white boxes, not black boxes, because we have full read-write access to their internals, which is partially why AI is so effective today. We are the innate reward system, which already aligns our brain to survival and critically doing all of this with almost no missteps, and the missteps aren’t very severe.
The meme of AI as black box needs to die.
These posts can help you get better intuitions, at least:
The fact that we have access to AI internals does not mean we understand them. We refer to them as black boxes because we do not understand how their internals produce their answers; this is, so to speak, opaque to us.
AI happening through deep learning at all is a huge update against alignment success, because deep learning is incredibly opaque. LLMs possibly ending up at the center is a small update in favor of alignment success, because it means we might (through some clever sleight, this part is not trivial) be able to have humanese sentences play an inextricable role at the center of thought (hence MIRI’s early interest in the Visible Thoughts Project).
The part where LLMs are to predict English answers to some English questions about values, and show common-sense relative to their linguistic shadow of the environment as it was presented to them by humans within an Internet corpus, is not actually very much hope because a sane approach doesn’t involve trying to promote an LLM’s predictive model of human discourse about morality to be in charge of a superintelligence’s dominion of the galaxy. What you would like to promote to values are concepts like “corrigibility”, eg “low impact” or “soft optimization”, which aren’t part of everyday human life and aren’t in the training set because humans do not have those values.
It seems like those goals are all in the training set, because humans talk about those concepts. Corrigibility is elaborations of “make sure you keep doing what these people say”, etc.
It seems like you could simply use an LLM’s knowledge of concepts to define alignment goals, at least to a first approximation. I review one such proposal here. There’s still an important question about how perfectly that knowledge generalizes with continued learning, and to OOD future contexts. But almost no one is talking about those questions. Many are still saying “we have no idea how to define human values”, when LLMs can capture much of any definition you like.
I want to note that this part:
This is wrong, and this disagreement is at a very deep level why I think on the object level that LW was wrong.
AIs are white boxes, not black boxes, because we have full read-write access to their internals, which is partially why AI is so effective today. We are the innate reward system, which already aligns our brain to survival and critically doing all of this with almost no missteps, and the missteps aren’t very severe.
The meme of AI as black box needs to die.
These posts can help you get better intuitions, at least:
https://forum.effectivealtruism.org/posts/JYEAL8g7ArqGoTaX6/ai-pause-will-likely-backfire#White_box_alignment_in_nature
The fact that we have access to AI internals does not mean we understand them. We refer to them as black boxes because we do not understand how their internals produce their answers; this is, so to speak, opaque to us.