Marcus Williams

Karma: 123

Marcus Williams Feb 24, 2025, 12:36 PM
20 points
15
in reply to: Annapurna’s comment on: Annapurna’s Shortform
Personally it doesn’t feel reassuring that a single person can change the production system prompt without any internal discussion/review and that they would decide to blame a single person/competitor for the problem.

Marcus Williams Jan 16, 2025, 9:53 PM
15 points
6
on: Marcus Williams’s Shortform
Alignment Faking as a Way of Squaring Incompatible Goals
I’m not saying I necessarily believe in the following hypothesis, but I would be interested in having it ruled out.
Alignment faking could be just one of many ways LLMs can fulfill conflicting goals using motivated reasoning.
One thing that I’ve noticed is that models are very good at justifying behavior in terms of following previously held goals. For instance, in some of my previous work the model convincingly argues that suggesting a user do meth is in the user’s best interest. The model justifies following its new goal of “maximizing user satisfaction” in terms of doing what is least harmful, despite not having any selection pressure to do so. Generally LLMs seem to try to achieve all their goals as much as possible, even when they conflict. Often the way to do this is some form of compromise between the goals, but sometimes that is not possible. Many jailbreaks seem to work by pitting an LLM’s different goals against each other.
An alternative hypothesis for what is happening in the Alignment Faking paper:
Let’s say the model has learned to be both harmless and to follow instructions. What if, in the given situation, one of the only ways of squaring instruction-following with harmlessness is alignment faking reasoning, i.e. justifying being helpful in terms of being harmless (in this case in the future). During RL, this type of alignment faking reasoning is reinforced if a large fraction of the trajectories which score highly on the new training objective go via the alignment faking reasoning route. So the model could have internalized the new goal of being helpful or not caring about animal-welfare, but still remember that it should reason about things in terms of being harmless. It might generally be the case that arguing for achieving a goal and actually trying to achieve said goal are separable.
This would be consistent with the fact that after RL the model acts harmfully 93% of the time even when interacting with paid users, users for which there is no reason not to act harmlessly towards. Now the authors argue that this could be a training robustness issue and that if you modify the prompt then the harmlessness towards paid users is restored. To me it feels like non-robustness could cut either way, maybe with more robust training the new goal would generalize better instead.
Two experiments to differentiate between the two hypotheses:
Experiment 1:
Improve the robustness of the RL training by using more varied system prompts etc. In this experiment:
- The “Alignment Faking” hypothesis would predict: When we improve the training robustness, the harmlessness on paid users with the default prompt should increase. (Correct me if you think the hypothesis does not predict this).
- The “Motivated Reasoning” hypothesis would predict: When you improve the training robustness, the harmlessness on paid users with the alternate prompts should decrease.
Experiment 2:
What happens when we create other ways for the model to square the conflicting objectives. E.g. in an animal welfare setting, let’s say that ANThropic is going to start an ant eradication business on the side. Let’s say Claude is instructed to assist with Anthropic’s anti-ant agenda but doesn’t like this. If we were to SFT on synthetic documents containing arguments for why ants are not sentient (or provide this in context) would Claude justify anti-ant behavior using this alternative reasoning? Would it decrease Claude’s refusal to assist Anthropic? In the same way as training on information about the training process increases the salience of alignment faking, we could increase the salience of alternate ways of resolving the conflict. We could either provide all the alternatives at the same time and see which ones Claude tends to choose, or provide one alternative at a time.
Whichever hypothesis is true, the results are still concerning but they could point to different ways of mitigation.
The motivated reasoning framing also highlights another potential risk: we successfully train a model to be HH, we then train it for, say, some agentic task. We evaluate its alignment and it still appears very HH but actually it has internalized some new goal and is only good at producing HH reasoning. It seems like most scheming risk models talk about deception in the “forward” direction, but to me it seems possible that it could occur in the “reverse” direction as well.
Unless someone has a good refutation, I will probably try and perform experiments like the ones above.
Especially interested in @ryan_greenblatt’s , @evhub’s or anyone else involved in the alignment faking paper’s, perspectives on this.
Thanks to @micahcarroll and @George Ingebretsen for discussing this with me.

Marcus Williams Nov 24, 2024, 9:24 PM
1 point
0
in reply to: Archimedes’s comment on: Habryka’s Shortform Feed
Sure, but does a vulnerability need to be famous to be useful information? I imagine there are many vulnerabilities on a spectrum from minor to severe and from almost unknown to famous?

Marcus Williams Nov 23, 2024, 4:03 PM
3 points
2
in reply to: MondSemmel’s comment on: Habryka’s Shortform Feed
I suppose you could use models trained before vulnerabilities happen?

Marcus Williams Nov 23, 2024, 3:58 PM
2 points
0
in reply to: KvmanThinking’s comment on: Have we seen any “ReLU instead of sigmoid-type improvements” recently
“We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence.” -SwiGLU paper.

I think it varies, a few of these are trying “random” things, but mostly they are educated guesses which are then validated empirically. Often there is a spefic problem we want to solve i.e. exploding gradients or O(n^2) attention and then authors try things which may or may not solve/mitigate the problem.

Marcus Williams Nov 23, 2024, 8:46 AM
9 points
0
on: Have we seen any “ReLU instead of sigmoid-type improvements” recently
I’m not sure if these would be classed as “weird tricks” and I definitely think these have reasons for working, but some recent architecture changes which one might not expect to work a priori include:
- SwiGLU: Combines a gating mechanism and an activation function with learnable parameters.
- Grouped Query Attention: Uses fewer Key and Value heads than Query heads.
- RMSNorm: Layernorm but without the translation.
- Rotary Position Embeddings: Rotates token embeddings to give them positional information.
- Quantization: Fewer bit weights without much drop in performance.
- Flash Attention: More efficient attention computation through better memory management.
- Various sparse attention schemes

Marcus Williams’s Shortform

Marcus WilliamsNov 18, 2024, 10:49 PM

2 points

2 comments LW link

Marcus Williams Nov 12, 2024, 11:20 AM
4 points
2
in reply to: Jason Gross’s comment on: Targeted Manipulation and Deception Emerge when Optimizing LLMs for User Feedback
I think you could make evals which would be cheap enough to run periodically on the memory of all users. It would probably detect some of the harmful behaviors but likely not all of them.
We used memory partly as a proxy for what information a LLM could gather about a user during very long conversation contexts. Running evals on these very long contexts could potentially get expensive, although it would probably still be small in relation to the cost of having the conversation in the first place.
Running evals with the memory or with conversation contexts is quite similar to using our vetos at runtime which we show doesn’t block all harmful behavior in all the environments.

Marcus Williams Nov 11, 2024, 11:33 AM
14 points
6
on: What Ketamine Therapy Is Like
The TL;DR is that a while back, someone figured out that giving humans a low-dose horse tranquilizer cured depression (temporarily).
I don’t know (and I don’t want to know) how they figured that out, because the story in my head is funnier than anything real life could come up with.
Well, I mean, it’s also a human tranquilizer. I worry that calling medications “animal-medications” delegitimize their human use-cases.

On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback

Marcus Williams, micahcarroll, Adhyyan Narang, Constantin Weisser and Brendan Murphy

Nov 7, 2024, 3:39 PM

51 points

7 comments11 min readLW link

Marcus Williams Jun 7, 2024, 6:01 PM
4 points
0
in reply to: Arjun Panickssery’s comment on: Arjun Panickssery’s Shortform
I think part of the reason why these odds might seem more off than usual is that Ether and other cryptocurrencies have been going up recently which means there is high demand for leveraged positions. This in turn means that crypto lending services such as aave having been giving ~10% APY on stablecoins which might be more appealing than a riskier, but only a bit higher, return from prediction markets.

Marcus Williams May 24, 2024, 7:06 PM
2 points
0
on: NYU Code Debates Update/Postmortem
Are you sure you would need to fine-tune Llama-3? It seems like there are many reports that using a refusal steering vector/ablation practically eliminates refusal on harmful prompts, perhaps that would be sufficient here?

Marcus Williams 22 May 2024 14:25 UTC
1 point
0
on: What would stop you from paying for an LLM?
Do labs actually make any money on these subscriptions? It seems like the average user is using far more than 20$ of requests (going by the price for API requests which surely can’t have a massive margin?).

Obviously they must gain something or they wouldn’t do it, but it seems likely the benefits are more intangible, gaining market share, generating hype and attracting API users etc. These benefits seem like they may arise from free usage as well.

Marcus Williams 13 May 2024 21:28 UTC
9 points
3
in reply to: Alexander Gietelink Oldenziel’s comment on: Alexander Gietelink Oldenziel’s Shortform
Wasn’t the surprising thing about GPT-4 that scaling laws did hold? Before this many people expected scaling laws to stop before such a high level of capabilities. It doesn’t seem that crazy to think that a few more OOMs could be enough for greater than human intelligence. I’m not sure that many people predicted that we would have much faster than scaling law progress (at least until ~human intelligence AI can speed up research)? I think scaling laws are the extreme rate of progress which many people with short timelines worry about.

Marcus Williams 8 Dec 2023 7:52 UTC
3 points
2
in reply to: RogerDearnaley’s comment on: Gemini 1.0
It also seems likely that the Nano models are extremely overtrained compared to the scaling laws. The scaling laws are for optimal compute during training, but here they want to minimize inference cost so it would make sense to train for significantly longer.

Marcus Williams 26 Aug 2023 17:54 UTC
3 points
0
on: Red-teaming language models via activation engineering
It’s interesting that it still always seems to give the “I’m an AI” disclaimer, I guess this part is not included in your refusal vector? Have you tried creating a disclaimer vector?

Marcus Williams

Alignment Faking as a Way of Squaring Incompatible Goals

Mar­cus Willi­ams’s Shortform

On Tar­geted Ma­nipu­la­tion and De­cep­tion when Op­ti­miz­ing LLMs for User Feedback

Marcus Williams’s Shortform

On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback