I really like your recent series of posts that succinctly address common objections/questions/suggestions about alignment concerns. I’m making a list to show my favorite skeptics (all ML/AI people; nontechnical people, as Connor Leahy puts it, tend to respond “You fucking what? Oh hell no!” or similar when informed that we are going to make genuinely smarter-than-us AI soonish).
We do have ways to get an AI to do what we want. The hardcoded algorithmic maximizer approach seems to be utterly impractical at this point. That leaves us with approaches that don’t obviously do a good job of preserving their own goals as they learn and evolve:
training a system to pursue things we like, such as shard theory and similar approaches.
training or hand-coding a critic system, as in outlined approaches from Steve Byrnes and me as well as many others. Nicely summarized as a Steering systems approach. This seems a bit less sketchy than training in our goals and hoping they generalize adequately, but still pretty sketchy.
Telling the agent what to do in a Natural language alignment approach. This seems absurdly naive. However, I’m starting to think our first human-plus AGIs will be wrapped or scaffolded LLMs, and they to a nontrivial degree actually think in natural language. People are right now specifying goals in natural language, and those can include alignment goals (or destroying humanity, haha). I just wrote an in-depth post on the potential Capabilities and alignment of LLM cognitive architectures, but I don’t have a lot to say about stability in that post.
None of these directly address what I’m calling The alignment stability problem, to give a name to what you’re addressing here. I think addressing it will work very differently in each of the three approaches listed above, and might well come down to implementational details within each approach. I think we should be turning our attention to this problem along with the initial alignment problems, because some of the optimism in the field stems from thinking about initial alignment and not long-term stability.
Edit: I left out Ozyrus’s posts on approach 3. He’s the first person I know of to see agentized LLMs coming, outside of David Shapiro’s 2021 book. His post was written a year ago and posted two weeks ago to avoid infohazards. I’m sure there are others who saw this coming more clearly than I did, but I thought I’d try to give credit where it’s due.
I don’t think so. That’s one breaking point for alignment, but I’m saying in that post that even if we avoid a sharp left turn and make it to an aligned, superintelligent AGI, that its alignment may drift away from human values as it continues to learn. Learning may necessarily shift the meanings of existing concepts, including values.
I really like your recent series of posts that succinctly address common objections/questions/suggestions about alignment concerns. I’m making a list to show my favorite skeptics (all ML/AI people; nontechnical people, as Connor Leahy puts it, tend to respond “You fucking what? Oh hell no!” or similar when informed that we are going to make genuinely smarter-than-us AI soonish).
We do have ways to get an AI to do what we want. The hardcoded algorithmic maximizer approach seems to be utterly impractical at this point. That leaves us with approaches that don’t obviously do a good job of preserving their own goals as they learn and evolve:
training a system to pursue things we like, such as shard theory and similar approaches.
training or hand-coding a critic system, as in outlined approaches from Steve Byrnes and me as well as many others. Nicely summarized as a Steering systems approach. This seems a bit less sketchy than training in our goals and hoping they generalize adequately, but still pretty sketchy.
Telling the agent what to do in a Natural language alignment approach. This seems absurdly naive. However, I’m starting to think our first human-plus AGIs will be wrapped or scaffolded LLMs, and they to a nontrivial degree actually think in natural language. People are right now specifying goals in natural language, and those can include alignment goals (or destroying humanity, haha). I just wrote an in-depth post on the potential Capabilities and alignment of LLM cognitive architectures, but I don’t have a lot to say about stability in that post.
None of these directly address what I’m calling The alignment stability problem, to give a name to what you’re addressing here. I think addressing it will work very differently in each of the three approaches listed above, and might well come down to implementational details within each approach. I think we should be turning our attention to this problem along with the initial alignment problems, because some of the optimism in the field stems from thinking about initial alignment and not long-term stability.
Edit: I left out Ozyrus’s posts on approach 3. He’s the first person I know of to see agentized LLMs coming, outside of David Shapiro’s 2021 book. His post was written a year ago and posted two weeks ago to avoid infohazards. I’m sure there are others who saw this coming more clearly than I did, but I thought I’d try to give credit where it’s due.
Maybe the alignment stability problem is the same thing as the sharp left turn?
I don’t think so. That’s one breaking point for alignment, but I’m saying in that post that even if we avoid a sharp left turn and make it to an aligned, superintelligent AGI, that its alignment may drift away from human values as it continues to learn. Learning may necessarily shift the meanings of existing concepts, including values.