Interesting results. Thanks for doing the investigation. I would love to see some examples of the chain of thoughts before and after optimisation and whether it’s basically the same thing but reworded or very semantically different. Did you notice anything interesting when reading them?
My intuition at least for the single word case is that saying a word increases the probability the model will say it in future. Therefore, if it wants the probability it says that the future to be low, it’s incentivised not to say it anywhere
The llm judge results seem notably more interesting and I’d love to better understand what’s up there
Hype! A 15 min brainstorm
What would you work on if not control? Bonus points for sketching out the next 5+ new research agendas you would pursue, in priority order, assuming each previous one stopped being neglected
What is the field of ai safety messing up? Bonus: For (field) in {AI safety fields}: What are researchers in $field wrong about/making poor decisions about, in a way that significantly limits their impact?
What are you most unhappy about with how the control field has grown and the other work happening elsewhere?
What are some common beliefs by AI safety researchers about their domains of expertise that you disagree with (pick your favourite domain)?
What beliefs inside Constellation have not percolated into the wider safety community but really should?
What have you changed your mind about in the last 12 months?
You say that you don’t think control will work indefinitely and that’s sufficiently capable models will break it. Can you make that more concrete? What kind of early warning signs could we observe? Will we know when we reach models capable enough that we can no longer trust control?
If you were in charge of Anthropic what would you do?
If you were David Sacks, what would you do?
If you had had a hundred cracked mats scholars and $10,000 of compute each, what would you have them do?
If I gave you billions of dollars and 100 top researchers at a Frontier lab, what would you do?
I’m concerned that the safety community spends way too much energy on more meta things like control, evals, interpretability, etc. And has somewhat lost sight of solving the damn alignment problem. Takes? if you agree what do you think someone who wants to solve the alignment problem should actually be doing about it right now?
What are examples of the safety questions that you think are important, and can likely be studied on models in the next 2 years but not on today’s publicly available frontier models? (0.5? 1? 5? Until the 6 months before AGI?)
If you were wrong about a belief that you are currently over 50% on to do with safety, what do you predict it is and why?
What model organisms would you be most excited to see people produce? (Ditto any other the open source work)
What are some mistakes you predict many listeners are making? Bonus points for mistakes you think I personally am making
What is the most positive true thing you have to say about the field of ambitious mechanistic interpretability
What does redwood look for when hiring people, especially junior researchers?
What kind of mid-career professionals would you be most excited to see switch to control. What about other areas of air safety?
What should AGI lab safety researchers be doing differently to have a greater impact? Feel free to give a different answer per lab