My view is that we’ve already made some significant progress on alignment, compared to say where we were O(15) years ago, and have also had some unexpectedly lucky breaks. Personally I’d list:
Value learning, as a potential solution to issues like corrigibility and the shut-down problem.
Once your value learner is a STEM-capable AGI, then doing or assisting with alignment research becomes a convergent instrumental strategy for it.
The closest thing we currently have to an AGI, LLMs, are fortunately not particularly agentic, they’re more of a tool AI (until you wrap them in a script to run them in a loop with suitable prompts).
To be more specific, for the duration of generating a specific document (at least before RLHF), an LLM emulates the output of a human or humans generating text, so to the extent that they pick up/emulate agentic behavior from us, it’s myopic past the end of document, and emulates some human(s) who have contributed text to their training set. Semi-randomly-chosen humans are a type of agent that humans are unusually good at understanding and predicting. The orthogonality thesis doesn’t apply to them: they will have an emulation of some version of human values. Like actual random humans, they’re not inherently fully aligned, but on average they’re distinctly better than paperclip maximizers. (Also both RLHF and prompts can alter the random distribution.)
While human values are large and fragile, LLMs are capable of capturing fairly good representations of large fragile things, including human values. So things like constitutional RL work. That still leaves concerns about what happens when we apply optimization pressure or distribution shifts to these representations of human values, but it’s at least a lot better than expecting us to hand-craft a utility function for the entire of human values in symbolic form. If we could solve knowing when an LLM representation of human values was out-of distribution and not reliable, then we might actually have a basis for an AGI-alignment solution that I wouldn’t expect to immediately kill everyone. (For example, it might make an acceptable initial setting to preload into an AGI value learner that could then refine it and extend its region of validity.) Even better, knowing when an LLM isn’t able to give a reliable answer is a capabilities problem, not just an alignment problem, since it’s the same issue as getting an LLM to reply “I don’t know” when asked a question to which it would otherwise have hallucinated a false answer. So all of the companies buying and selling access to LLMs are strongly motivated to solve this. (Indeed, leading LLM companies appear to have made significant progress on reducing hallucination rates in the last year.)
This is a personal list and I’m sure will be missing some items.
That we’ve made some progress and had some lucky breaks doesn’t guarantee that this will continue, but it’s unsurprising to me that
alignment research in the context of a specific technology that we can actually experiment with is easier than trying to do alignment research in abstract for arbitrary future systems, and that
with more people interested in alignment research we’re making progress faster.
My view is that we’ve already made some significant progress on alignment, compared to say where we were O(15) years ago, and have also had some unexpectedly lucky breaks. Personally I’d list:
Value learning, as a potential solution to issues like corrigibility and the shut-down problem.
Once your value learner is a STEM-capable AGI, then doing or assisting with alignment research becomes a convergent instrumental strategy for it.
The closest thing we currently have to an AGI, LLMs, are fortunately not particularly agentic, they’re more of a tool AI (until you wrap them in a script to run them in a loop with suitable prompts).
To be more specific, for the duration of generating a specific document (at least before RLHF), an LLM emulates the output of a human or humans generating text, so to the extent that they pick up/emulate agentic behavior from us, it’s myopic past the end of document, and emulates some human(s) who have contributed text to their training set. Semi-randomly-chosen humans are a type of agent that humans are unusually good at understanding and predicting. The orthogonality thesis doesn’t apply to them: they will have an emulation of some version of human values. Like actual random humans, they’re not inherently fully aligned, but on average they’re distinctly better than paperclip maximizers. (Also both RLHF and prompts can alter the random distribution.)
While human values are large and fragile, LLMs are capable of capturing fairly good representations of large fragile things, including human values. So things like constitutional RL work. That still leaves concerns about what happens when we apply optimization pressure or distribution shifts to these representations of human values, but it’s at least a lot better than expecting us to hand-craft a utility function for the entire of human values in symbolic form. If we could solve knowing when an LLM representation of human values was out-of distribution and not reliable, then we might actually have a basis for an AGI-alignment solution that I wouldn’t expect to immediately kill everyone. (For example, it might make an acceptable initial setting to preload into an AGI value learner that could then refine it and extend its region of validity.) Even better, knowing when an LLM isn’t able to give a reliable answer is a capabilities problem, not just an alignment problem, since it’s the same issue as getting an LLM to reply “I don’t know” when asked a question to which it would otherwise have hallucinated a false answer. So all of the companies buying and selling access to LLMs are strongly motivated to solve this. (Indeed, leading LLM companies appear to have made significant progress on reducing hallucination rates in the last year.)
This is a personal list and I’m sure will be missing some items.
That we’ve made some progress and had some lucky breaks doesn’t guarantee that this will continue, but it’s unsurprising to me that
alignment research in the context of a specific technology that we can actually experiment with is easier than trying to do alignment research in abstract for arbitrary future systems, and that
with more people interested in alignment research we’re making progress faster.