A concerning amount of alignment research is focused on fixing misalignment in contemporary models, with limited justification for why we should expect these techniques to extend to more powerful future systems.
By improving the performance of today’s models, this research makes investing in AI capabilities more attractive, increasing existential risk.
Imagine an alternative history in which GPT-3 had been wildly unaligned. It would not have posed an existential risk to humanity but it would have made putting money into AI companies substantially less attractive to investors.
Counterpoint: Sydney Bing was wildly unaligned, to the extent that it is even possible for an LLM to be aligned, and people thought it was cute / cool.
I was not precise enough in my language and agree with you highlighting that what “alignment” means for LLM is a bit vague. While people felt Sydney Bing was cool, if it was not possible to reign it in it would have made it very difficult for Microsoft to gain any market share. An LLM that doesn’t do what it’s asked or regularly expresses toxic opinions is ultimately bad for business.
In the above paragraph understand “aligned” to mean in the concrete sense of “behaves in a way that is aligned with it’s parent companies profit motive”, rather than “acting in line with humanities CEV”. To rephrase the point I was making above, I feel much of (a majority even) of today’s alignment research is focused on the the first definition of alignment, whilst neglecting the second.
I would go further than this. Future architectures will not only be designed for improved performance, but they will be (hopefully) increasingly designed to optimize safety and interpretability as well, so they will likely be much different than the architectures we see today. It seems to me (this is my personal opinion based on my own research for cryptocurrency technologies, so my opinion does not match anyone else’s opinion) that non-neural network machine learning models (but which are probably still trained by moving in the direction of a vector field) or at least safer kinds of neural network architectures are needed. The best thing to do will probably to work on alignment, interpretability, and safety for all known kinds of AI models and develop safer AI architectures. Since future systems will be designed not just for performance but for alignability, safety, and interpretability as well, we may expect for these future systems to be easier to align than systems that are simply designed for performance.
A concerning amount of alignment research is focused on fixing misalignment in contemporary models, with limited justification for why we should expect these techniques to extend to more powerful future systems.
By improving the performance of today’s models, this research makes investing in AI capabilities more attractive, increasing existential risk.
Imagine an alternative history in which GPT-3 had been wildly unaligned. It would not have posed an existential risk to humanity but it would have made putting money into AI companies substantially less attractive to investors.
Counterpoint: Sydney Bing was wildly unaligned, to the extent that it is even possible for an LLM to be aligned, and people thought it was cute / cool.
I was not precise enough in my language and agree with you highlighting that what “alignment” means for LLM is a bit vague. While people felt Sydney Bing was cool, if it was not possible to reign it in it would have made it very difficult for Microsoft to gain any market share. An LLM that doesn’t do what it’s asked or regularly expresses toxic opinions is ultimately bad for business.
In the above paragraph understand “aligned” to mean in the concrete sense of “behaves in a way that is aligned with it’s parent companies profit motive”, rather than “acting in line with humanities CEV”. To rephrase the point I was making above, I feel much of (a majority even) of today’s alignment research is focused on the the first definition of alignment, whilst neglecting the second.
See also thoughts on the impact of RLHF research.
I would go further than this. Future architectures will not only be designed for improved performance, but they will be (hopefully) increasingly designed to optimize safety and interpretability as well, so they will likely be much different than the architectures we see today. It seems to me (this is my personal opinion based on my own research for cryptocurrency technologies, so my opinion does not match anyone else’s opinion) that non-neural network machine learning models (but which are probably still trained by moving in the direction of a vector field) or at least safer kinds of neural network architectures are needed. The best thing to do will probably to work on alignment, interpretability, and safety for all known kinds of AI models and develop safer AI architectures. Since future systems will be designed not just for performance but for alignability, safety, and interpretability as well, we may expect for these future systems to be easier to align than systems that are simply designed for performance.