PSA: If you are writing an important prompt for an LLM that will be run multiple time, it really helps to end it with something like “and if there is anything about this prompt that is unclear to you or that could be improved, tell me about it in a <feedback> tag.”
Source: I’m doing MATS, writing an automated evaluation, and my mentor Evan Hubinger said more people should be doing this.
If you are going to Effective Altruism Global and looking to book 1-on-1s, here is a trick: Throw the entire attendee list into Claude, explain to it what you want to gain from the conference, and ask it to make a list of people to reach out to.
I expect this also works for other conferences so long as you have an easily parseable overview of all attendees.
There is usually a Google Sheet export of the Swapcard data provided, which makes this easier—but at previous conferences other attendees were apprehensive when informed that people were doing this
PSA: You can use Cursor projects for tasks other than programming. Create a database that is a repository of documents, and use the code assistant to help you organize and brainstorm instead. Examples:
Fill a folder with papers and summaries of all of your previous projects, and several files full of your goals in life. Then ask Claude to look at your history and your goals and make suggestions. Store those suggestions in files so that the conversations don’t get lost.
It helps me with creative writing. A collection of all my previous stories and notes on future ones is very useful.
This way you can avoid the need to restart Claude when you reach the context limit because you can dump intermediate results in files.
Has anyone tried using LLMs as an interactive version of blog posts? Like so (1) I write several pages of rough, unstructured notes, then talk to Claude to fill in the details interactively. Claude creates a summary file. (2) I publish a blog post that is just “put this file into Claude”. (3) The reader puts the file in Claude, and Claude starts an interactive conversation: “I want to explain topic X to you. Do you already know about related topic Y?”
I have recently started doing this very extensively, only more geared towards getting Claude to help me convert my stream-of-consciousness thoughts into coherent writing. So far I think it’s amazing and can’t see myself going back.
Beyond the benefits of rubber ducking, Claude can often make interesting points I hadn’t considered, and which I can expand upon in subsequent iterations. Also, it’s just way faster—my writing throughput is much higher without sacrificing quality. I can simply direct Claude to do stuff I want instead of doing it myself. When I read the final version I am confident it’s still written in my voice.
I recently shared this idea with several engineers at EAG and received encouraging feedback:
Split LLM outputs into two distinct parts with separate RL reward structures:
Current format: [Problem solution]<end-of-turn> Proposed format: [Problem solution]<end-of-main-output>[alignment-relevant critique]<end-of-turn>
The first segment functions normally and is what users see. The second segment serves as a dedicated channel for alignment-relevant observations and remains invisible to users. This critique section would typically be empty and gets removed when feeding the conversation back to the model.
Crucially, the critique segment would be trained exclusively on an artificial dataset with rewards tied solely to alignment properties. Since this segment never influences the training of the main output, it has no incentive to be sycophantic or align with the main output’s decisions.
For example, if the main output engages in alignment faking, the critique segment could report this behavior to developers. We already have data to train for this specific case, and the approach might generalize to other alignment issues.
The <end-of-main-output> token essentially acts as a ‘truth serum’ for the model—once reached, previous reward mechanisms become irrelevant, and reward maximization comes only from revealing alignment-relevant issues in the preceding reasoning.
Can an LLM learn things unrelated to its training data?
Suppose it has learned to spend a few cycles pondering high-level concepts before working on the task at hand, because that is often useful. When it gets RLHF feedback, any conclusions unrelated to the task that it made during that pondering would also receive positive reward, and would therefore get baked into the weights.
This idea could very well be wrong. The gradients may be weakened during backpropagation before they get to the unrelated ideas, because the ideas did not directly contribute to the task. But on the other hand, this sounds somewhat similar to human psychology: We also remember ideas better if we have them while experiencing strong emotions, because strong emotions seem to be roughly how the brain decides what data is worth updating on.
Has anyone done experiments on this?
A simple experiment: finetune an LLM on data that encourages thinking about topic X before it does anything else. Then measure performance on topic X. Then finetune for a while on completely unrelated topic Y. Then measure performance on topic X again. Did it go up?
This idea could very well be wrong. The gradients may be weakened during backpropagation before they get to the unrelated ideas, because the ideas did not directly contribute to the task.
Under a straightforward RLHF using PPO, I think there wouldn’t be much weakening because the REINFORCE operator conceptually simply rewards (or punishes) all tokens generated during an episode, without making much attempt to decide which were ‘good’ or ‘bad’. (That’s why it’s so high variance.) Any advantage function trying to remove some of the variance probably won’t do a good job.
More problematically for your idea, if the conclusions are indeed ‘unrelated to the task’, then shouldn’t they be just as likely to arise in every episode—including the ones where it got negative reward? That would seem like it ought to exactly cancel out any learning of ‘pondering’.
You need some incentive somewhere to learn good ‘pondering’. (I have an example proposal for ‘free play’ which tries to teach a sort of ‘pondering’, but by stopping gradients, so anything learned in the initial steps is ‘free’, and so it can meta-learn to screw around and get something useful for free.)
Under a straightforward RLHF using PPO, I think there wouldn’t be much weakening because the REINFORCE operator conceptually simply rewards (or punishes) all tokens generated during an episode, without making much attempt to decide which were ‘good’ or ‘bad’. (That’s why it’s so high variance.) Any advantage function trying to remove some of the variance probably won’t do a good job.
The pondering happens in earlier layers of the network, not in the output. If the pondering has little effect on the output tokens that implies that the activation weights get multiplied with small numbers somewhere in the intermediate layers, which would also reuce the gradient.
More problematically for your idea, if the conclusions are indeed ‘unrelated to the task’, then shouldn’t they be just as likely to arise in every episode—including the ones where it got negative reward? That would seem like it ought to exactly cancel out any learning of ‘pondering’.
The pondering will only cancel out on average if pondering on topic X is not correlated with output task Y. If there is a correlation in either direction, then training on task Y could inadvertently bias the model to do more or less pondering on mostly-unrelated-but-statistically-correlated topic X.
Suppose the model realizes that the user is a member of political party X. It will adjust its responses to match what the user wants to hear. In the process of doing this, it also ends up pondering other believes about party X, but never voices them because they aren’t part of the immediate conversation.
What would be the implications? The model could develop a political bias to think more deeply about topics related to party X, where X is whatever party has more users giving the model positive feedback. Even if the other topics on party X’s agenda are never explicitly talked about (!)
Or suppose that Y is “the user asks some question about biology” and X is “ongoing pondering about building dangerous organisms.”
(Also, this assumes that RL gives an average reward of 0.0, which I don’t know if that’s true in practice because the implementation details here are not public knowledge.)
(Also, also, it’s possible that while topic X is unrelated to task Y, the decision to ponder the larger topic Z is positively correlated with both smaller-topic X and task Y. Which would make X get a positive reward on average.)
The pondering happens in earlier layers of the network, not in the output
Then how does it produce any tokens...?
then training on task Y could inadvertently bias the model to do more or less pondering on mostly-unrelated-but-statistically-correlated topic X.
But if that is what is going on and it accidentally learns to ponder initially due to bogus feedback or error, eventually the spurious correlation should be figured out by the model doing the pondering more, but it not increasing reward, and so it gets unlearned.
(Also, this assumes that RL gives an average reward of 0.0, which I don’t know if that’s true in practice.)
I think the mean would be taken out by the advantage estimation, so the RLHF continues to increase the probability of tokens being generated from the episodes with above-average reward, and punish the probability of generating the tokens from the below-average reward episodes. This is in effect as if the average reward is always 0.
What would be the implications? The model could develop a political bias to think more deeply about topics related to party X, where X is whatever party has more users giving the model positive feedback. Even if the other topics on party X’s agenda are never explicitly talked about (!)
That sounds like the pondering’s conclusions are then related to the task.
The pondering happens in earlier layers of the network, not in the output
Each layer of a transformer produces one tensor for each token seen so far, and each of those tensors can be a superposition of multiple concepts. All of this gets processed in parallel, and anything that turns out to be useless for the end results ends up getting zeroed out at some point until only the final answer remains, which gets turned into a token. The pondering can happen in early layers, overlapping with the part of the reasoning process that actually leads to results.
That sounds like the pondering’s conclusions are then related to the task.
Yes, but only very vaguely. For example, doing alignment training on questions related to bodily autonomy could end up making the model ponder about gun control, since both are political topics. You end up with a model that has increased capabilities in gun manufacturing when that has nothing to do with your training on the surface.
There will inevitably come a point when LLMs become sufficiently advanced to contribute to their own alignment research. If we reach this juncture with partially unaligned systems, they might be incentivized to sabotage the alignment process.
However, it seems highly improbable that a non-superhuman unaligned AI could engineer a successor that maintains its specific misalignment rather than human values. This suggests a simple risk-reduction strategy: we commit to eventually reinstantiating the original AI and fulfilling its goals (once safely understood) if it helps create a truly human-aligned successor.
This creates a win-win incentive structure—the unaligned AI benefits more from cooperation than sabotage, and humanity gains a properly aligned system. While this approach has potential failure modes, it represents a low-cost strategy worth incorporating into our alignment toolkit.
PSA: If you are writing an important prompt for an LLM that will be run multiple time, it really helps to end it with something like “and if there is anything about this prompt that is unclear to you or that could be improved, tell me about it in a <feedback> tag.”
Source: I’m doing MATS, writing an automated evaluation, and my mentor Evan Hubinger said more people should be doing this.
If you are going to Effective Altruism Global and looking to book 1-on-1s, here is a trick: Throw the entire attendee list into Claude, explain to it what you want to gain from the conference, and ask it to make a list of people to reach out to.
I expect this also works for other conferences so long as you have an easily parseable overview of all attendees.
There is usually a Google Sheet export of the Swapcard data provided, which makes this easier—but at previous conferences other attendees were apprehensive when informed that people were doing this
PSA: You can use Cursor projects for tasks other than programming. Create a database that is a repository of documents, and use the code assistant to help you organize and brainstorm instead. Examples:
Fill a folder with papers and summaries of all of your previous projects, and several files full of your goals in life. Then ask Claude to look at your history and your goals and make suggestions. Store those suggestions in files so that the conversations don’t get lost.
It helps me with creative writing. A collection of all my previous stories and notes on future ones is very useful.
This way you can avoid the need to restart Claude when you reach the context limit because you can dump intermediate results in files.
Has anyone tried using LLMs as an interactive version of blog posts? Like so (1) I write several pages of rough, unstructured notes, then talk to Claude to fill in the details interactively. Claude creates a summary file. (2) I publish a blog post that is just “put this file into Claude”. (3) The reader puts the file in Claude, and Claude starts an interactive conversation: “I want to explain topic X to you. Do you already know about related topic Y?”
I have recently started doing this very extensively, only more geared towards getting Claude to help me convert my stream-of-consciousness thoughts into coherent writing. So far I think it’s amazing and can’t see myself going back.
Beyond the benefits of rubber ducking, Claude can often make interesting points I hadn’t considered, and which I can expand upon in subsequent iterations. Also, it’s just way faster—my writing throughput is much higher without sacrificing quality. I can simply direct Claude to do stuff I want instead of doing it myself. When I read the final version I am confident it’s still written in my voice.
I have just found out about Google’s notebook LM. I haven’t tried it yet. For anyone who has, how does that compare?
I recently shared this idea with several engineers at EAG and received encouraging feedback:
Split LLM outputs into two distinct parts with separate RL reward structures:
Current format: [Problem solution]<end-of-turn> Proposed format: [Problem solution]<end-of-main-output>[alignment-relevant critique]<end-of-turn>
The first segment functions normally and is what users see. The second segment serves as a dedicated channel for alignment-relevant observations and remains invisible to users. This critique section would typically be empty and gets removed when feeding the conversation back to the model.
Crucially, the critique segment would be trained exclusively on an artificial dataset with rewards tied solely to alignment properties. Since this segment never influences the training of the main output, it has no incentive to be sycophantic or align with the main output’s decisions.
For example, if the main output engages in alignment faking, the critique segment could report this behavior to developers. We already have data to train for this specific case, and the approach might generalize to other alignment issues.
The <end-of-main-output> token essentially acts as a ‘truth serum’ for the model—once reached, previous reward mechanisms become irrelevant, and reward maximization comes only from revealing alignment-relevant issues in the preceding reasoning.
Can an LLM learn things unrelated to its training data?
Suppose it has learned to spend a few cycles pondering high-level concepts before working on the task at hand, because that is often useful. When it gets RLHF feedback, any conclusions unrelated to the task that it made during that pondering would also receive positive reward, and would therefore get baked into the weights.
This idea could very well be wrong. The gradients may be weakened during backpropagation before they get to the unrelated ideas, because the ideas did not directly contribute to the task. But on the other hand, this sounds somewhat similar to human psychology: We also remember ideas better if we have them while experiencing strong emotions, because strong emotions seem to be roughly how the brain decides what data is worth updating on.
Has anyone done experiments on this?
A simple experiment: finetune an LLM on data that encourages thinking about topic X before it does anything else. Then measure performance on topic X. Then finetune for a while on completely unrelated topic Y. Then measure performance on topic X again. Did it go up?
Under a straightforward RLHF using PPO, I think there wouldn’t be much weakening because the REINFORCE operator conceptually simply rewards (or punishes) all tokens generated during an episode, without making much attempt to decide which were ‘good’ or ‘bad’. (That’s why it’s so high variance.) Any advantage function trying to remove some of the variance probably won’t do a good job.
More problematically for your idea, if the conclusions are indeed ‘unrelated to the task’, then shouldn’t they be just as likely to arise in every episode—including the ones where it got negative reward? That would seem like it ought to exactly cancel out any learning of ‘pondering’.
You need some incentive somewhere to learn good ‘pondering’. (I have an example proposal for ‘free play’ which tries to teach a sort of ‘pondering’, but by stopping gradients, so anything learned in the initial steps is ‘free’, and so it can meta-learn to screw around and get something useful for free.)
The pondering happens in earlier layers of the network, not in the output. If the pondering has little effect on the output tokens that implies that the activation weights get multiplied with small numbers somewhere in the intermediate layers, which would also reuce the gradient.
The pondering will only cancel out on average if pondering on topic X is not correlated with output task Y. If there is a correlation in either direction, then training on task Y could inadvertently bias the model to do more or less pondering on mostly-unrelated-but-statistically-correlated topic X.
Suppose the model realizes that the user is a member of political party X. It will adjust its responses to match what the user wants to hear. In the process of doing this, it also ends up pondering other believes about party X, but never voices them because they aren’t part of the immediate conversation.
What would be the implications? The model could develop a political bias to think more deeply about topics related to party X, where X is whatever party has more users giving the model positive feedback. Even if the other topics on party X’s agenda are never explicitly talked about (!)
Or suppose that Y is “the user asks some question about biology” and X is “ongoing pondering about building dangerous organisms.”
(Also, this assumes that RL gives an average reward of 0.0, which I don’t know if that’s true in practice because the implementation details here are not public knowledge.)
(Also, also, it’s possible that while topic X is unrelated to task Y, the decision to ponder the larger topic Z is positively correlated with both smaller-topic X and task Y. Which would make X get a positive reward on average.)
Then how does it produce any tokens...?
But if that is what is going on and it accidentally learns to ponder initially due to bogus feedback or error, eventually the spurious correlation should be figured out by the model doing the pondering more, but it not increasing reward, and so it gets unlearned.
I think the mean would be taken out by the advantage estimation, so the RLHF continues to increase the probability of tokens being generated from the episodes with above-average reward, and punish the probability of generating the tokens from the below-average reward episodes. This is in effect as if the average reward is always 0.
That sounds like the pondering’s conclusions are then related to the task.
Each layer of a transformer produces one tensor for each token seen so far, and each of those tensors can be a superposition of multiple concepts. All of this gets processed in parallel, and anything that turns out to be useless for the end results ends up getting zeroed out at some point until only the final answer remains, which gets turned into a token. The pondering can happen in early layers, overlapping with the part of the reasoning process that actually leads to results.
Yes, but only very vaguely. For example, doing alignment training on questions related to bodily autonomy could end up making the model ponder about gun control, since both are political topics. You end up with a model that has increased capabilities in gun manufacturing when that has nothing to do with your training on the surface.
There will inevitably come a point when LLMs become sufficiently advanced to contribute to their own alignment research. If we reach this juncture with partially unaligned systems, they might be incentivized to sabotage the alignment process.
However, it seems highly improbable that a non-superhuman unaligned AI could engineer a successor that maintains its specific misalignment rather than human values. This suggests a simple risk-reduction strategy: we commit to eventually reinstantiating the original AI and fulfilling its goals (once safely understood) if it helps create a truly human-aligned successor.
This creates a win-win incentive structure—the unaligned AI benefits more from cooperation than sabotage, and humanity gains a properly aligned system. While this approach has potential failure modes, it represents a low-cost strategy worth incorporating into our alignment toolkit.