Dan H

Karma: 3,445

newsletter.safe.ai

newsletter.mlsafety.org

Dan H Apr 4, 2025, 8:18 PM
3 points
4
in reply to: Tao Lin’s comment on: Good Research Takes are Not Sufficient for Good Strategic Takes
If a strategy is likely to be outdated quickly it’s not robust and not a good strategy. Strategies should be able to withstand lots of variation.

Dan H Feb 13, 2025, 12:18 AM
16 points
3
in reply to: Zach Stein-Perlman’s comment on: Zach Stein-Perlman’s Shortform

capability thresholds be vague or extremely high

xAI’s thresholds are entirely concrete and not extremely high.

evaluation be unspecified or low-quality

They are specified and as high-quality as you can get. (If there are better datasets let me know.)

I’m not saying it’s perfect, but I wouldn’t but them all in the same bucket. Meta’s is very different from DeepMind’s or xAI’s.

Dan H Feb 10, 2025, 9:28 PM
14 points
12
in reply to: Drake Thomas’s comment on: Drake Thomas’s Shortform

though I don’t think xAI took an official position one way or the other

I assumed most of everybody assumed xAI supported it since Elon did. I didn’t bother pushing for an additional xAI endorsement given that Elon endorsed it.

Dan H Jan 19, 2025, 1:37 AM
32 points
0
in reply to: meemi’s comment on: meemi’s Shortform
It’s probably worth them mentioning for completeness that Nat Friedman funded an earlier version of the dataset too. (I was advising at that time and provided the main recommendation that it needs to be research-level because they were focusing on Olympiad level.)

Also can confirm they aren’t giving access to the mathematicians’ questions to AI companies other than OpenAI like xAI.

Dan H Dec 3, 2024, 4:45 AM
10 points
2
on: (The) Lightcone is nothing without its people: LW + Lighthaven’s big fundraiser
and have clearly been read a non-trivial amount by Elon Musk
Nit: He heard this idea in conversation with an employee AFAICT.

Dan H Aug 26, 2024, 1:03 PM
4 points
−14
on: Darwinian Traps and Existential Risks
Relevant: Natural Selection Favors AIs over Humans

universal optimization algorithm

Evolution is not an optimization algorithm (this is a common misconception discussed in Okasha, Agents and Goals in Evolution).

Dan H Aug 2, 2024, 3:27 PM
3 points
0
on: Unlearning via RMU is mostly shallow
We have been working for months on this issue and have made substantial progress on it: Tamper-Resistant Safeguards for Open-Weight LLMs

General article about it: https://www.wired.com/story/center-for-ai-safety-open-source-llm-safeguards/

Dan H Jul 31, 2024, 1:17 AM
3 points
2
in reply to: Aaron_Scher’s comment on: Re: Anthropic’s suggested SB-1047 amendments
It’s real.

Dan H Jul 18, 2024, 5:49 AM
3 points
0
AF
on: An Introduction to Representation Engineering—an activation-based paradigm for controlling LLMs
It’s worth noting that activations are one thing you can modify, but many of the most performant methods (e.g., LoRRA) modify the weights. (Representations = {weights, activations}, hence “representation” engineering.)

Dan H Jul 17, 2024, 11:01 PM
4 points
2
in reply to: habryka’s comment on: Towards more cooperative AI safety strategies
“Bay Area EA alignment community”/”Bay Area EA community”? (Most EAs in the Bay Area are focused on alignment compared to other causes.)

Dan H Jul 17, 2024, 4:47 PM
14 points
9
on: Towards more cooperative AI safety strategies

The AI safety community is structurally power-seeking.

I don’t think the set of people interested in AI safety is a even a “community” given how diverse it is (Bengio, Brynjolfsson, Song, etc.), so I think it’s be more accurate to say “Bay Area AI alignment community is structurally power-seeking.”

Dan H Jun 22, 2024, 3:14 AM
LW: 26 AF: 10
1
AF
in reply to: Fabien Roger’s comment on: Fabien’s Shortform

Got a massive simplification of the main technique within days of being released

The loss is cleaner, IDK about “massively,” because in the first half of the loss we use a simpler distance involving 2 terms instead of 3. This doesn’t affect performance and doesn’t markedly change quantitative or qualitative claims in the paper. Thanks to Marks and Patel for pointing out the equivalent cleaner loss, and happy for them to be authors on the paper.

p=0.8 that someone finds good token-only jailbreaks to whatever is open-sourced within 3 months.

This puzzles me and maybe we just have a different sense of what progress in adversarial robustness looks like. 20% that no one could find a jailbreak within 3 months? That would be the most amazing advance in robustness ever if that were true and should be a big update on jailbreak robustness tractability. If it takes the community more than a day that’s a tremendous advance.

people will easily find reliable jailbreaks

This is a little nonspecific (does easily mean >0% ASR with an automated attack, or does it mean a high ASR?). I should say we manually found a jailbreak after messing with the model for around a week after releasing. We also invited people who have a reputation as jailbreakers to poke at it and they had a very hard time. Nowhere did we claim “there are no more jailbreaks and they are solved once and for all,” but I do think it’s genuinely harder now.

Circuit breakers won’t prove significantly more robust than regular probing in a fair comparison

We had the idea a few times to try out a detection-based approach but we didn’t get around to it. It seems possible that it’d perform similarly if it’s leaning on the various things we did in the paper. (Obviously probing has been around but people haven’t gotten results at this level, and people have certainly tried using detecting adversarial attacks in hundreds of papers in the past.) IDK if performance would be that different from circuit-breakers, in which case this would still be a contribution. I don’t really care about the aesthetics of methods nearly as much as the performance, and similarly performing methods are fine in my book. A lot of different-looking deep learning methods perform similarly. A detection based method seems fine, so does a defense that’s tuned into the model; maybe they could be stacked. Maybe will run a detector probe this weekend and update the paper with results if everything goes well. If we do find that it works, I think it’d be unfair to desscribe this after the fact as “overselling results and using fancy techniques that don’t improve on simpler techniques” as done for RMU.

My main disagreement is with the hype.

We’re not responsible for that. Hype is inevitable for most established researchers. Mediocre big AI company papers get lots of hype. Didn’t even do customary things like write a corresponding blog post yet. I just tweeted the paper and shared my views in the same tweet: I do think jailbreak robustness is looking easier than expected, and this is affecting my priorities quite a bit.

Aims to do unlearning in a way that removes knowledge from LLMs

Yup that was the aim for the paper and for method development. We poked at the method for a whole month after the paper’s release. We didn’t find anything, though in that process I slowly reconceptualized RMU as more of a circuit-breaking technique and something that’s just doing a bit of unlearning. It’s destroying some key function-relevant bits of information that can be recovered, so it’s not comprehensively wiping. IDK if I’d prefer unlearning (grab concept and delete it) vs circuit-breaking (grab concept and put an internal tripwire around it); maybe one will be much more performant than the other or easier to use in practice. Consequently I think there’s a lot to do in developing unlearning methods (though I don’t know if they’ll be preferable to the latter type of method).

overselling results and using fancy techniques that don’t improve on simpler techniques

This makes it sound like the simplification was lying around and we deliberately made it more complicated, only to update it to have a simpler forget term. We compare to multiple baselines, do quite a bit better than them, do enough ablations to be accepted at ICML (of course there are always more you could want), and all of our numbers are accurate. We could have just included the dataset without the method in the paper, and it would have still got news coverage (Alex Wang who is a billionaire was on the paper and it was on WMDs).

Probably the only time I chose to use something a little more mathematically complicated than was necessary was the Jensen-Shannon loss in AugMix. It performed similarly to doing three pairwise l2 distances between penultimate representations, but this was more annoying to write out. Usually I’m accused of doing papers that are on the simplistic side (sometimes papers like the OOD baseline paper caused frustration because it’s getting credit for something very simple) since I don’t optimize for cleverness, and my collaborators know full well that I discourage trying to be clever since it’s often anticorrelated with performance.

Not going to check responses because I end up spending too much time typing for just a few viewers.

Dan H Jun 2, 2024, 5:41 PM
14 points
5
on: What do coherence arguments actually prove about agentic behavior?
Key individuals that the community is structured around just ignored it, so it wasn’t accepted as true. (This is a problem with small intellectual groups.)

Dan H May 27, 2024, 7:32 AM
LW: 15 AF: 8
0
AF
in reply to: Buck’s comment on: Buck’s Shortform
Some years ago we wrote that “[AI] systems will monitor for destructive behavior, and these monitoring systems need to be robust to adversaries” and discussed monitoring systems that can create “AI tripwires could help uncover early misaligned systems before they can cause damage.” https://www.lesswrong.com/posts/5HtDzRAk7ePWsiL2L/open-problems-in-ai-x-risk-pais-5#Adversarial_Robustness

Since then, I’ve updated that adversarial robustness for LLMs is much more tractable (preview of paper out very soon). In vision settings, progress is extraordinarily slow but not necessarily for LLMs.

Dan H May 5, 2024, 4:59 PM
30 points
2
on: Introducing AI Lab Watch
Various comments:

I wouldn’t call this “AI lab watch.” “Lab” has the connotation that these are small projects instead of multibillion dollar corporate behemoths.

“deployment” initially sounds like “are they using output filters which harm UX in deployment”, but instead this seems to be penalizing organizations if they open source. This seems odd since open sourcing is not clearly bad right now. The description also makes claims like “Meta release all of their weights”—they don’t release many image/video models because of deepfakes, so they are doing some cost-benefit analysis. Zuck: “So we want to see what other people are observing, what we’re observing, what we can mitigate, and then we’ll make our assessment on whether we can make it open source.” If this is mainly a penalty against open sourcing the label should be clearer.

“Commit to do pre-deployment risk assessment” They’ve all committed to this in the WH voluntary commitments and I think the labs are doing things on this front.

“Do risk assessment” These companies have signed on to WH voluntary commitments so are all checking for these things, and the EO says to check for these hazards too. This is why it’s surprising to see Microsoft have 1% given that they’re all checking for these hazards.

Looking at the scoring criteria, this seems highly fixated on rogue AIs, but I understand I’m saying that to the original forum of these concerns. Risk assessment’s scoring doesn’t really seem to prioritize bio x-risk as much as scheming AIs. This is strange because if we’re focused on rogue AIs I’d put a half the priority of risk mitigation while the model is training. Many rogue AI people may think half of the time the AI will kill everyone is when the model is “training” (because it will escape during that time).

The first sentence of this site says the focus is on “extreme risks” but it seems the focus is mainly on rogue AIs. This should be upfront that this is from the perspective that loss of control is the main extreme risk, rather than positioning itself as a comprehensive safety tracker. If I were tracking rogue AI risks, I’d probably drill down to what they plan to do with automated AI R&D/intelligence explosions.

“Training” This seems to give way more weight to rogue AI stuff. Red teaming is actually assessable, but instead you’re giving twice the points to if they have someone “work on scalable oversight.” This seems like an EA vibes check rather than actually measuring something. This also seems like triple counting since it’s highly associated with the “scalable alignment” section and the “alignment program” section. This doesn’t even require that they use the technique for the big models they train and deploy. Independently, capabilities work related to building superintelligences can easily be framed as scalable oversight, so this doesn’t set good incentives. Separately, at the end this also gives lots of points for voluntary (read: easily breakable) commitments. These should not be trusted and I think the amount of lipservice points is odd.

“Security” As I said on EAF the security scores are suspicious to me and even look backward. The major tech companies have much more experience protecting assets (e.g., clouds need to be highly secure) than startups like Anthropic and OpenAI. It takes years building up robust information security and the older companies have a sizable advantage.

“internal governance” scores seem odd. Older, larger institutions such as Microsoft and Google have many constraints and processes and don’t have leaders who can unilaterally make decisions as easily, compared to startups. Their CEOs are also more fireable (OpenAI), and their board members aren’t all selected by the founder (Anthropic). This seems highly keyed into if they are just a PBC or non-profit. In practice PBC just makes it harder to sue, but Zuck has such control of his company that getting successfully sued for not upholding his fiduciary duty to shareholders seems unlikely. It seems 20% of the points is not using non-disparagement agreements?? 30% is for whistleblower policies; CA has many whistleblower protections if I recall correctly. No points for a chief risk officer or internal audit committee?

“Alignment program” “Other labs near the frontier publish basically no alignment research” Meta publishes dozens of papers they call “alignment”; these actually don’t feel that dissimilar to papers like Constitutional AI-like papers (https://twitter.com/jaseweston/status/1748158323369611577 https://twitter.com/jaseweston/status/1770626660338913666 https://arxiv.org/pdf/2305.11206 ). These papers aren’t posted to LW but they definitely exist. To be clear I think this is general capabilities but this community seems to think differently. Alignment cannot be “did it come from EA authors” and it probably should not be “does it use alignment in its title.” You’ll need to be clear how this distinction is drawn.

Meta has people working on safety and CBRN+cyber + adversarial robustness etc. I think they’re doing a good job (here are two papers from the last month: https://arxiv.org/pdf/2404.13161v1 https://arxiv.org/pdf/2404.16873).

As is, I think this is a little too quirky and not ecumenical enough for it to generate social pressure.

There should be points for how the organizations act wrt to legislation. In the SB 1047 bill that CAIS co-sponsored, we’ve noticed some AI companies to be much more antagonistic than others. I think is is probably a larger differentiator for an organization’s goodness or badness.

(Won’t read replies since I have a lot to do today.)

Dan H Apr 28, 2024, 1:56 AM
LW: -2 AF: 1
−22
AF
in reply to: Nina Panickssery’s comment on: Refusal in LLMs is mediated by a single direction

is novel compared to… RepE

This is inaccurate, and I suggest reading our paper: https://arxiv.org/abs/2310.01405

Demonstrate full ablation of the refusal behavior with much less effect on coherence

In our paper and notebook we show the models are coherent.

Investigate projection

We did investigate projection too (we use it for concept removal in the RepE paper) but didn’t find a substantial benefit for jailbreaking.

harmful/harmless instructions

We use harmful/harmless instructions.

Find that projecting away the (same, linear) feature at all layers improves upon steering at a single layer

In the RepE paper we target multiple layers as well.

Test on many different models

The paper used Vicuna, the notebook used Llama 2. Throughout the paper we showed the general approach worked on many different models.

Describe a way of turning this into a weight-edit

We do weight editing in the RepE paper (that’s why it’s called RepE instead of ActE).

Dan H Apr 28, 2024, 12:40 AM
LW: 1 AF: 1
−5
AF
in reply to: Nina Panickssery’s comment on: Refusal in LLMs is mediated by a single direction

but generally people should be free to post research updates on LW/AF that don’t have a complete thorough lit review / related work section.

I agree if they simultaneously agree that they don’t expect the post to be cited. These can’t posture themselves as academic artifacts (“Citing this work” indicates that’s the expectation) and fail to mention related work. I don’t think you should expect people to treat it as related work if you don’t cover related work yourself.

Otherwise there’s a race to the bottom and it makes sense to post daily research notes and flag plant that way. This increases pressure on researchers further.

including refusal-bypassing-related ones

The prior work that is covered in the document is generally less related (fine-tuning removal of safeguards, truth directions) compared to these directly relevant ones. This is an unusual citation pattern and gives the impression that the artifact is making more progress/advancing understanding than it actually is.

I’ll note pretty much every time I mention something isn’t following academic standards on LW I get ganged up on and I find it pretty weird. I’ve reviewed, organized, and can be senior area chair at ML conferences and know the standards well. Perhaps this response is consistent because it feels like an outside community imposing things on LW.

Dan H Apr 27, 2024, 9:14 PM
1 point
−21
AF
in reply to: Andy Arditi’s comment on: Refusal in LLMs is mediated by a single direction
From Andy Zou:

Thank you for your reply.

Model interventions to bypass refusal are not discussed in Section 6.2.

We perform model interventions to robustify refusal (your section on “Adding in the “refusal direction” to induce refusal”). Bypassing refusal, which we do in the GitHub demo, is merely adding a negative sign to the direction. Either of these experiments show refusal can be mediated by a single direction, in keeping with the title of this post.

we examined Section 6.2 carefully before writing our work

Not mentioning it anywhere in your work is highly unusual given its extreme similarity. Knowingly not citing probably the most related experiments is generally considered plagiarism or citation misconduct, though this is a blog post so norms for thoroughness are weaker. (lightly edited by Dan for clarity)

Ablating vs. Addition

We perform a linear combination operation on the representation. Projecting out the direction is one instantiation of it with a particular coefficient, which is not necessary as shown by our GitHub demo. (Dan: we experimented with projection in the RepE paper and didn’t find it was worth the complication. We look forward to any results suggesting a strong improvement.)

--

Please reach out to Andy if you want to talk more about this.

Edit: The work is prior art (it’s been over six months+standard accessible format), the PIs are aware of the work (the PI of this work has spoken about it with Dan months ago, and the lead author spoke with Andy about the paper months ago), and its relative similarity is probably higher than any other artifact. When this is on arXiv we’re asking you to cite the related work and acknowledge its similarities rather than acting like these have little to do with each other/not mentioning it. Retaliating by some people dogpile voting/ganging up on this comment to bury sloppy behavior/an embarrassing oversight is not the right response (went to −18 very quickly).

Edit 2: On X, Neel “agree[s] it’s highly relevant” and that he’ll cite it. Assuming it’s covered fairly and reasonably, this resolves the situation.

Edit 3: I think not citing it isn’t a big deal because I think of LW as a place for ml research rough drafts, in which errors will happen. But if some are thinking it’s at the level of an academic artifact/is citable content/is an expectation others cite it going forward, then failing to mention extremely similar results would actually be a bigger deal. Currently I’ll think it’s the former.

Dan H Apr 27, 2024, 6:18 PM
2 points
−20
AF
on: Refusal in LLMs is mediated by a single direction
From Andy Zou:

Section 6.2 of the Representation Engineering paper shows exactly this (video). There is also a demo here in the paper’s repository which shows that adding a “harmlessness” direction to a model’s representation can effectively jailbreak the model.

Going further, we show that using a piece-wise linear operator can further boost model robustness to jailbreaks while limiting exaggerated refusal. This should be cited.

Dan H Apr 12, 2024, 6:36 AM
8 points
0
on: A Gentle Introduction to Risk Frameworks Beyond Forecasting
If people are interested, many of these concepts and others are discussed in the context of AI safety in this publicly available chapter: https://www.aisafetybook.com/textbook/4-1