Steering Gemini with BiDPO

TurnTroutJan 31, 2025, 2:37 AM

LW: 104 AF: 52

AI Activation Engineering Postmortems & Retrospectives

Coauthored with Mark Kurzeja

A while back, we explored the “BiDPO” method for training steering vectors. In Gemini 1.5v1 Flash and Pro, BiDPO steering vectors boosted TruthfulQA scores by >10% while mostly retaining capabilities. When we updated to Gemini 1.5v2, prompt-based steering baselines became significantly stronger. BiDPO did not beat the stronger baselines, ending the project.
...
BiDPO seems effective and sample-efficient but does not currently exceed more standard baselines. It’s hard to draw firm conclusions about BiDPO because TruthfulQA might not be measuring truthfulness ／factuality. However, we remain excited about DPO-driven Conditional Activation Steering, which has additional advantages—particularly for targeted loss mitigation.

This result is largely negative. I wanted to share it to increase scientific understanding around steering! We also conducted a postmortem on why the method stopped outperforming baselines.

I’d also like to note that @ryan_greenblatt’s skepticism predicted this outcome more strongly than my worldview did. I want him to get points for that. :) While I think steering has targeted applications and provides clues about how LLMs function, it’s not a slam-dunk Pareto improvement on benchmarks we care about.

Read at https://turntrout.com/gemini-steering!^[1]

^
Also mirrored on the GDM safety research Medium.

TurnTroutJan 31, 2025, 2:37 AM

LW: 104 AF: 52

5 comments1 min readLW link

AI Activation Engineering Postmortems & Retrospectives

TurnTrout Jan 31, 2025, 4:36 AM
LW: 104 AF: 52
5
AF

I remember right when the negative results started hitting. I could feel the cope rising. I recognized the pattern, the straining against truth. I queried myself for what I found most painful—it was actually just losing a bet. I forced the words out of my mouth: “I guess I was wrong to be excited about this particular research direction. And Ryan was more right than I was about this matter.”
After that, it was all easier. What was there to be afraid of? I’d already admitted it!
- Eli Tyre Feb 2, 2025, 5:43 AM
  10 points
  8
  Parent
  
  I find your commitment to the basics of rational epistemology inspiring.
  
  Keep it up and let me know if you could use support.
Kabir Kumar Jan 31, 2025, 2:52 AM
15 points
6

Thank you for sharing negative results!!
lemonhope Feb 4, 2025, 5:53 AM
LW: 3 AF: 2
0
AF

What do you think is the ideal use-case for steering? Or is it not needed
Martin Vlach Jan 31, 2025, 2:53 PM
1 point
0

link to https://www.alignmentforum.org/users/ryan_greenblatt seems malformed, - instead of _, that is.

Steering Gemini with BiDPO

Read at https://​​turntrout.com/​​gemini-steering![1]

Read at https://turntrout.com/gemini-steering!^[1]