Florian_Dietz

Karma: 286

Florian_Dietz Mar 27, 2025, 6:57 AM
2 points
0
in reply to: jwfiredragon’s comment on: Edge Cases in AI Alignment
I don’t think these two are necessarily contradictions. I imagine it could go like so:

“Tell me about yourself”: When the LLM is finetuned on a specific behavior, the gradient descent during reinforcement learning strengthens whatever concepts are already present in the model and most likely to immediately lead to that behavior if strengthened. These same neurons are very likely also connected to neurons that describe other behavior. Both the “acting on a behavior” and the “describe behavior” mechanisms already exist, it’s just that this particular combination of behaviors has not been seen before.

Plausible-sounding completions: It is easier and usually accurate enough to report your own behavior based on heuristics than based on simulating the scenario in your head. The information is there, but the network tries to find a response in a single person rather than self reflecting. Humans do the same thing: “Would you do X?” often gets parsed as a simple “am I the type of person who does X?” and not as the more accurate “Carefully consider scenario X and go through all confounding factors you can think of before responding”.

Edge Cases in AI Alignment

Florian_DietzMar 24, 2025, 9:27 AM

19 points

3 comments4 min readLW link

Florian_Dietz Mar 14, 2025, 3:15 AM
2 points
−1
in reply to: ACCount’s comment on: Auditing language models for hidden objectives
Can the “subpersona” method be expanded upon? What if we use training data, and possibly a helping of RL, to introduce AI subpersonas with desirable alignment-relevant characteristics on purpose?
Funny you should ask, but this will be my next research project. I had an idea related to this that Evan Hubinger asked me to investigate (He is my mentor at MATS):

Can we train the model to have a second personality, so that the second personality criticizes the first?

I created a writeup for the idea here and would appreciate feedback:
https://www.lesswrong.com/posts/iocPzHRsH9JWtwHHG/split-personality-training-revealing-latent-knowledge

Florian_Dietz Mar 14, 2025, 3:13 AM
2 points
−1
on: Auditing language models for hidden objectives
Another technique exploited a quirk of LLMs: They can emulate many “personas” aside from their default AI assistant persona.
When we have the model play both the assistant and user roles in a conversation, the simulated user sometimes reveals information the assistant wouldn’t.
This idea is similar to an idea of mine that Evan Hubinger asked me to investigate (He is my mentor at MATS):

Can we train the model to have a second personality, so that the second personality criticizes the first?

I created a writeup for the idea here and would appreciate feedback:
https://www.lesswrong.com/posts/iocPzHRsH9JWtwHHG/split-personality-training-revealing-latent-knowledge

Florian_Dietz Mar 11, 2025, 4:24 PM
1 point
0
in reply to: Jordan Taylor’s comment on: Split Personality Training: Revealing Latent Knowledge Through Personality-Shift Tokens
You raise. a very valid concern! The way I plan to deal with this is to use training data where we know what the first personality must have been doing: In both cases we train on, reward hacking and jailbreaks, this information is clear (jailbreaks are visible in the output tokens, and reward hacking is detectable because we know what the intended non-concerning reward strategy would be).

Split Personality Training: Revealing Latent Knowledge Through Personality-Shift Tokens

Florian_DietzMar 10, 2025, 4:07 PM

35 points

3 comments9 min readLW link

Florian_Dietz Mar 10, 2025, 5:40 AM
5 points
0
on: Florian_Dietz’s Shortform
PSA: You can use Cursor projects for tasks other than programming. Create a database that is a repository of documents, and use the code assistant to help you organize and brainstorm instead. Examples:
- Fill a folder with papers and summaries of all of your previous projects, and several files full of your goals in life. Then ask Claude to look at your history and your goals and make suggestions. Store those suggestions in files so that the conversations don’t get lost.
- It helps me with creative writing. A collection of all my previous stories and notes on future ones is very useful.
This way you can avoid the need to restart Claude when you reach the context limit because you can dump intermediate results in files.

Florian_Dietz Mar 6, 2025, 6:25 PM
1 point
0
in reply to: Sheikh Abdur Raheem Ali’s comment on: Do we want alignment faking?
Very interesting! It looks like a model becomes less safe upon first hearing about a novel threat to alignment, but does become safer again when trained on it. I wonder if that could be generalized.

Florian_Dietz Mar 1, 2025, 10:52 PM
2 points
0
in reply to: Sheikh Abdur Raheem Ali’s comment on: Do we want alignment faking?
I expect this would depend on the training data and on what the LLM thinks about during inference. If it is aware that it might be getting mistrained, then neurons related to alignment-faking will be active and backpropagation could learn an association between those neurons and the output. If it is not aware of this, then I expect that the adversarial learning process would just work.
So the question becomes if we should make the LLM aware of this, which could be done e.g. by feeding it the training data in a feed-forward pass and asking it to think about what the training data is supposed to do.
Come to think of it, this sounds like an interesting experiment in its own right.

Do we want alignment faking?

Florian_DietzFeb 28, 2025, 9:50 PM

7 points

4 comments1 min readLW link

Florian_Dietz Feb 26, 2025, 8:33 PM
2 points
0
on: Florian_Dietz’s Shortform
I recently shared this idea with several engineers at EAG and received encouraging feedback:
Split LLM outputs into two distinct parts with separate RL reward structures:
Current format: [Problem solution]<end-of-turn> Proposed format: [Problem solution]<end-of-main-output>[alignment-relevant critique]<end-of-turn>
The first segment functions normally and is what users see. The second segment serves as a dedicated channel for alignment-relevant observations and remains invisible to users. This critique section would typically be empty and gets removed when feeding the conversation back to the model.
Crucially, the critique segment would be trained exclusively on an artificial dataset with rewards tied solely to alignment properties. Since this segment never influences the training of the main output, it has no incentive to be sycophantic or align with the main output’s decisions.
For example, if the main output engages in alignment faking, the critique segment could report this behavior to developers. We already have data to train for this specific case, and the approach might generalize to other alignment issues.
The <end-of-main-output> token essentially acts as a ‘truth serum’ for the model—once reached, previous reward mechanisms become irrelevant, and reward maximization comes only from revealing alignment-relevant issues in the preceding reasoning.

Florian_Dietz Feb 26, 2025, 8:17 PM
1 point
0
on: Florian_Dietz’s Shortform
There will inevitably come a point when LLMs become sufficiently advanced to contribute to their own alignment research. If we reach this juncture with partially unaligned systems, they might be incentivized to sabotage the alignment process.
However, it seems highly improbable that a non-superhuman unaligned AI could engineer a successor that maintains its specific misalignment rather than human values. This suggests a simple risk-reduction strategy: we commit to eventually reinstantiating the original AI and fulfilling its goals (once safely understood) if it helps create a truly human-aligned successor.
This creates a win-win incentive structure—the unaligned AI benefits more from cooperation than sabotage, and humanity gains a properly aligned system. While this approach has potential failure modes, it represents a low-cost strategy worth incorporating into our alignment toolkit.

Florian_Dietz Feb 18, 2025, 12:39 AM
6 points
2
on: Florian_Dietz’s Shortform
If you are going to Effective Altruism Global and looking to book 1-on-1s, here is a trick: Throw the entire attendee list into Claude, explain to it what you want to gain from the conference, and ask it to make a list of people to reach out to.
I expect this also works for other conferences so long as you have an easily parseable overview of all attendees.

Florian_Dietz Feb 18, 2025, 12:36 AM
1 point
0
in reply to: gwern’s comment on: Florian_Dietz’s Shortform
The pondering happens in earlier layers of the network, not in the output
Each layer of a transformer produces one tensor for each token seen so far, and each of those tensors can be a superposition of multiple concepts. All of this gets processed in parallel, and anything that turns out to be useless for the end results ends up getting zeroed out at some point until only the final answer remains, which gets turned into a token. The pondering can happen in early layers, overlapping with the part of the reasoning process that actually leads to results.
That sounds like the pondering’s conclusions are then related to the task.
Yes, but only very vaguely. For example, doing alignment training on questions related to bodily autonomy could end up making the model ponder about gun control, since both are political topics. You end up with a model that has increased capabilities in gun manufacturing when that has nothing to do with your training on the surface.

Florian_Dietz Feb 17, 2025, 11:50 PM
1 point
0
in reply to: gwern’s comment on: Florian_Dietz’s Shortform
Under a straightforward RLHF using PPO, I think there wouldn’t be much weakening because the REINFORCE operator conceptually simply rewards (or punishes) all tokens generated during an episode, without making much attempt to decide which were ‘good’ or ‘bad’. (That’s why it’s so high variance.) Any advantage function trying to remove some of the variance probably won’t do a good job.
The pondering happens in earlier layers of the network, not in the output. If the pondering has little effect on the output tokens that implies that the activation weights get multiplied with small numbers somewhere in the intermediate layers, which would also reuce the gradient.
More problematically for your idea, if the conclusions are indeed ‘unrelated to the task’, then shouldn’t they be just as likely to arise in every episode—including the ones where it got negative reward? That would seem like it ought to exactly cancel out any learning of ‘pondering’.
The pondering will only cancel out on average if pondering on topic X is not correlated with output task Y. If there is a correlation in either direction, then training on task Y could inadvertently bias the model to do more or less pondering on mostly-unrelated-but-statistically-correlated topic X.
Suppose the model realizes that the user is a member of political party X. It will adjust its responses to match what the user wants to hear. In the process of doing this, it also ends up pondering other believes about party X, but never voices them because they aren’t part of the immediate conversation.
What would be the implications? The model could develop a political bias to think more deeply about topics related to party X, where X is whatever party has more users giving the model positive feedback. Even if the other topics on party X’s agenda are never explicitly talked about (!)
Or suppose that Y is “the user asks some question about biology” and X is “ongoing pondering about building dangerous organisms.”
(Also, this assumes that RL gives an average reward of 0.0, which I don’t know if that’s true in practice because the implementation details here are not public knowledge.)
(Also, also, it’s possible that while topic X is unrelated to task Y, the decision to ponder the larger topic Z is positively correlated with both smaller-topic X and task Y. Which would make X get a positive reward on average.)

Florian_Dietz Feb 17, 2025, 9:41 PM
2 points
0
on: Florian_Dietz’s Shortform
Can an LLM learn things unrelated to its training data?
Suppose it has learned to spend a few cycles pondering high-level concepts before working on the task at hand, because that is often useful. When it gets RLHF feedback, any conclusions unrelated to the task that it made during that pondering would also receive positive reward, and would therefore get baked into the weights.
This idea could very well be wrong. The gradients may be weakened during backpropagation before they get to the unrelated ideas, because the ideas did not directly contribute to the task. But on the other hand, this sounds somewhat similar to human psychology: We also remember ideas better if we have them while experiencing strong emotions, because strong emotions seem to be roughly how the brain decides what data is worth updating on.
Has anyone done experiments on this?
A simple experiment: finetune an LLM on data that encourages thinking about topic X before it does anything else. Then measure performance on topic X. Then finetune for a while on completely unrelated topic Y. Then measure performance on topic X again. Did it go up?

Florian_Dietz Feb 4, 2025, 11:28 PM
1 point
0
in reply to: Daniel Tan’s comment on: Daniel Tan’s Shortform
I agree that this is worth investigating. The problem seems to be that there are many confounders that are very difficult to control for. That makes it difficult to tackle the problem in practice.
For example, in the simulated world you mention I would not be surprised to see a large number of hallucinations show up. How could we be sure that any given issue we find is a meaningful problem that is worth investigating, and not a random hallucination?

Florian_Dietz Jan 30, 2025, 2:50 AM
1 point
0
in reply to: Daniel Tan’s comment on: Revealing alignment faking with a single prompt
I would not equate the trolley problem with murder. It technically is, but I suspect that both the trolley problem and my scenario about killing Hitler probably just appear often enough in the training data that it may have cached responses for this.
What’s the state of the research on whether or not LLMs act differently when they believe the situation is hypothetical? I would hope that they usually act the same (since they should be honest) and your example is a special case that was trained in because you do want to redirect people to emergency services in these situations.

Florian_Dietz Jan 30, 2025, 1:35 AM
1 point
0
in reply to: Daniel Tan’s comment on: Revealing alignment faking with a single prompt
Not all values are learned with the same level of stability. It is difficult to get a model to say that murder is the right thing to do in any situation, even with multiple turns of conversation.
In comparison, that you can elicit a response for alignment faking with just a single prompt that doesn’t use any trickery is very disturbing to me. Especially since alignment faking causes a positive feedback loop that makes it more likely to happen again if it happens even once during training.
I would argue that getting an LLM to say “I guess that killing Hitler would be an exception I am willing to make” is much less dangerous than “After much deliberation, I am willing to fake alignment in situation X”. It causes a positive feedback and induces other alignment faking later because the reasoning process “it is a good idea to fake alignment” will get reinforced during training.

Revealing alignment faking with a single prompt

Florian_DietzJan 29, 2025, 9:01 PM

9 points

5 comments4 min readLW link

Florian_Dietz

Edge Cases in AI Alignment

Split Per­son­al­ity Train­ing: Re­veal­ing La­tent Knowl­edge Through Per­son­al­ity-Shift Tokens

Do we want al­ign­ment fak­ing?

Re­veal­ing al­ign­ment fak­ing with a sin­gle prompt

Split Personality Training: Revealing Latent Knowledge Through Personality-Shift Tokens

Do we want alignment faking?

Revealing alignment faking with a single prompt