I’ve recently started in a research program for alignment. One outcome among many is that my beliefs and models have changed. Here I outline some ideas I’ve thought about.
The tone of this post is more like
“Here are possibilities and ideas I haven’t really realized that exist. I’m yet uncertain about how important they are, but seems worth thinking about them. (Maybe these points are useful to others as well?)”
than
“These new hypotheses I encountered are definitely right.”
This ended up rather long for a shortform post, but still posting it here as it’s quite low-effort and probably not of that wide of an interest.
Insight 1: You can have a an aligned model that is neither inner nor outer aligned.
Opposing frame: To solve the alignment problem, we need to solve both the outer and inner alignment: to choose a loss function whose global optimum is safe and then have the model actually aim to optimize it.
Current thoughts:
A major point of the inner alignment and related texts is that the outer optimization processes are cognition-modifiers, not terminal-values. Why on earth should we limit our cognition-modifiers to things that are kinda like humans’ terminal values?
That is: we can choose our loss functions and whatever so that they build good cognition inside the model, even if those loss functions’ global optimums were nothing like Human Values Are Satisfied. In fact, this seems like a more promising alignment approach than what the opposing frame suggests!
(What made this click for me was Evan Hubinger’s training stories post, in particular the excerpt
It’s worth pointing out how phrasing inner and outer alignment in terms of training stories makes clear what I think was our biggest mistake in formulating that terminology, which is that inner/outer alignment presumes that the right way to build an aligned model is to find an aligned loss function and then have a training goal of finding a model that optimizes for that loss function.
I’d guess that this post makes a similar point somewhere, though I haven’t fully read it.)
Insight 2: You probably want to prevent deceptively aligned models from arising in the first place, rather than be able to answer the question “is this model deceptive or not?” in the general case,
Elaboration: I’ve been thinking that “oh man, deceptive alignment and the adversarial set-up makes things really hard”. I hadn’t made the next thought of “okay, could we prevent deception from arising in the first place?”.
(“How could you possibly do that without being able to tell whether models are deceptive or not?”, one asks. See here for an answer. My summary: Decompose the model-space to three regions,
A) Verifiably Good models
B) models where the verification throws “false”
C) deceptively aligned models which trick us to thinking they are Verifiably Good.
Aim to find a notion of Verifiably Good such that one gradient-step never makes you go from A to C. Then you just start training from A, and if you end up at B, just undo your updates.)
Insight 3: There are more approaches to understanding our models / having transparent models than mechanistic interpretability.
Previous thought: We have to do mechanistic interpretability to understand our models!
Current thoughts: Sure, solving mech interp would be great. Still, there are other approaches:
Train models to be transparent. (Think: have a term in the loss function for transparency.)
Better understand training processes and inductive biases. (See e.g. work on deep double descent, grokking, phase changes, …)
Creating architectures that are more transparent by design
(Chain-of-thought faithfulness is about making LLMs thought processes interpretable in natural language)
Past-me would have objected “those don’t give you actual detailed understanding of what’s going on”. To which I respond:
“Yeah sure, again, I’m with you on solving mech interp being the holy grail. Still, it seems like using mech interp to conclusively answer ‘how likely is deceptive alignment’ is actually quite hard, whereas we can get some info about that by understanding e.g. inductive biases.”
(I don’t intend here to make claims about the feasibility of various approaches.)
Insight 4: It’s easier for gradient descent to do small updates throughout the net than a large update in one part.
(Comment by Paul Christiano as understood by me.) At least in the context where this was said, it was a good point for expecting that neural nets have distributed representations for things (instead of local “this neuron does...”).
Insight 5: You can focus on safe transformative AI vs. safe superintelligence.
Previous thought: “Oh man, lots of alignment ideas I see obviously fail at a sufficiently high capability level”
Current thought: You can reasonably focus on kinda-roughly-human-level AI instead of full superintelligence. Yep, you do want to explicitly think “this won’t work for superintelligences for reasons XYZ” and note which assumptions about the capability of the AI your idea relies on. Having done that, you can have plans aim to use AIs for stuff and you can have beliefs like
“It seems plausible that in the future we have AIs that can do [cognitive task], while not being capable enough to break [security assumption], and it is important to seize such opportunities if they appear.”
Past-me had flinch negative reactions about plans that fail at high capability levels, largely due to finding Yudkowsky’s model of alignment compelling. While I think it’s useful to instinctively think “okay, this plan will obviously fail once we get models that are capable of X because of Y” to notice the security assumptions, I think I went too far by essentially thinking “anything that fails for a superintelligence is useless”.
Insight 6:The reversal curse bounds the level of reasoning/inference current LLMs do.
(Owain Evans said something that made this click for me.) I think the reversal curse is good news in the sense of “current models probably don’t do anything too advanced under-the-hood”. I could imagine worlds where LLMs do a lot of advanced, dangerous inference during training—but the reversal curse being a thing is evidence against those worlds (in contrast to more mundane worlds).
I want to point out that insight 6 has less value in the current day, since it looks like the reversal curse turns out to be fixable very simply, and I agree with @gwern that the research was just a bit too half-baked for the extensive discussion it got:
I’ve recently started in a research program for alignment. One outcome among many is that my beliefs and models have changed. Here I outline some ideas I’ve thought about.
The tone of this post is more like
“Here are possibilities and ideas I haven’t really realized that exist. I’m yet uncertain about how important they are, but seems worth thinking about them. (Maybe these points are useful to others as well?)”
than
“These new hypotheses I encountered are definitely right.”
This ended up rather long for a shortform post, but still posting it here as it’s quite low-effort and probably not of that wide of an interest.
Insight 1: You can have a an aligned model that is neither inner nor outer aligned.
Opposing frame: To solve the alignment problem, we need to solve both the outer and inner alignment: to choose a loss function whose global optimum is safe and then have the model actually aim to optimize it.
Current thoughts:
A major point of the inner alignment and related texts is that the outer optimization processes are cognition-modifiers, not terminal-values. Why on earth should we limit our cognition-modifiers to things that are kinda like humans’ terminal values?
That is: we can choose our loss functions and whatever so that they build good cognition inside the model, even if those loss functions’ global optimums were nothing like Human Values Are Satisfied. In fact, this seems like a more promising alignment approach than what the opposing frame suggests!
(What made this click for me was Evan Hubinger’s training stories post, in particular the excerpt
I’d guess that this post makes a similar point somewhere, though I haven’t fully read it.)
Insight 2: You probably want to prevent deceptively aligned models from arising in the first place, rather than be able to answer the question “is this model deceptive or not?” in the general case,
Elaboration: I’ve been thinking that “oh man, deceptive alignment and the adversarial set-up makes things really hard”. I hadn’t made the next thought of “okay, could we prevent deception from arising in the first place?”.
(“How could you possibly do that without being able to tell whether models are deceptive or not?”, one asks. See here for an answer. My summary: Decompose the model-space to three regions,
A) Verifiably Good models
B) models where the verification throws “false”
C) deceptively aligned models which trick us to thinking they are Verifiably Good.
Aim to find a notion of Verifiably Good such that one gradient-step never makes you go from A to C. Then you just start training from A, and if you end up at B, just undo your updates.)
Insight 3: There are more approaches to understanding our models / having transparent models than mechanistic interpretability.
Previous thought: We have to do mechanistic interpretability to understand our models!
Current thoughts: Sure, solving mech interp would be great. Still, there are other approaches:
Train models to be transparent. (Think: have a term in the loss function for transparency.)
Better understand training processes and inductive biases. (See e.g. work on deep double descent, grokking, phase changes, …)
Creating architectures that are more transparent by design
(Chain-of-thought faithfulness is about making LLMs thought processes interpretable in natural language)
Past-me would have objected “those don’t give you actual detailed understanding of what’s going on”. To which I respond:
“Yeah sure, again, I’m with you on solving mech interp being the holy grail. Still, it seems like using mech interp to conclusively answer ‘how likely is deceptive alignment’ is actually quite hard, whereas we can get some info about that by understanding e.g. inductive biases.”
(I don’t intend here to make claims about the feasibility of various approaches.)
Insight 4: It’s easier for gradient descent to do small updates throughout the net than a large update in one part.
(Comment by Paul Christiano as understood by me.) At least in the context where this was said, it was a good point for expecting that neural nets have distributed representations for things (instead of local “this neuron does...”).
Insight 5: You can focus on safe transformative AI vs. safe superintelligence.
Previous thought: “Oh man, lots of alignment ideas I see obviously fail at a sufficiently high capability level”
Current thought: You can reasonably focus on kinda-roughly-human-level AI instead of full superintelligence. Yep, you do want to explicitly think “this won’t work for superintelligences for reasons XYZ” and note which assumptions about the capability of the AI your idea relies on. Having done that, you can have plans aim to use AIs for stuff and you can have beliefs like
“It seems plausible that in the future we have AIs that can do [cognitive task], while not being capable enough to break [security assumption], and it is important to seize such opportunities if they appear.”
Past-me had flinch negative reactions about plans that fail at high capability levels, largely due to finding Yudkowsky’s model of alignment compelling. While I think it’s useful to instinctively think “okay, this plan will obviously fail once we get models that are capable of X because of Y” to notice the security assumptions, I think I went too far by essentially thinking “anything that fails for a superintelligence is useless”.
Insight 6: The reversal curse bounds the level of reasoning/inference current LLMs do.
(Owain Evans said something that made this click for me.) I think the reversal curse is good news in the sense of “current models probably don’t do anything too advanced under-the-hood”. I could imagine worlds where LLMs do a lot of advanced, dangerous inference during training—but the reversal curse being a thing is evidence against those worlds (in contrast to more mundane worlds).
I want to point out that insight 6 has less value in the current day, since it looks like the reversal curse turns out to be fixable very simply, and I agree with @gwern that the research was just a bit too half-baked for the extensive discussion it got:
https://www.lesswrong.com/posts/SCqDipWAhZ49JNdmL/paper-llms-trained-on-a-is-b-fail-to-learn-b-is-a#FLzuWQpEmn3hTAtqD
https://www.lesswrong.com/posts/SCqDipWAhZ49JNdmL/paper-llms-trained-on-a-is-b-fail-to-learn-b-is-a#3cAiWvHjEeCffbcof