In the main conference, there were tons of papers mentioning safety/alignment but few of them are good as alignment has become a buzzword. Mechinterp is often no more advanced than where the EAs were in 2022.
Lots of progress on debate. On the empirical side, a debate paper got an oral. On the theory side, Jonah Brown-Cohen of Deepmind proves that debate can be efficient even when the thing being debated is stochastic, a version of this paper from last year. Apparently there has been some progress on obfuscated arguments too.
The Next Generation of AI Safety Workshop was kind of a mishmash of various topics associated with safety. Most of them were not related to x-risk, but there was interesting work on unlearning and other topics.
Reception to my Catastrophic Goodhart paper was decent. An information theorist said there were good theoretical reasons the two settings we studied—KL divergence and best-of-n—behaved similarly.
OpenAI gave a disappointing safety presentation at NGAIS touting their new technique of rules-based rewards, which is a variant of constitutional AI and seems really unambitious.
The mechinterp workshop often had higher-quality papers than the main conference. It was completely full. Posters were right next to each other and the room was so packed during talks they didn’t let people in.
I missed a lot of the workshop, so I need to read some posters before having takes.
My opinions on the state of published AI safety work:
Mechinterp is progressing but continues to need feedback loops, either from benchmarks (I’m excited about people building on our paper InterpBench) or downstream tasks where mechinterp outperforms fine-tuning alone.
Most of the danger from AI comes from goal-directed agents and instrumental convergence. There is little research now because we don’t have agents yet. In 1-3 years, foundation model agents will be good enough to study, and we need to be ready with the right questions and theoretical frameworks.
We still do not know enough about AI safety to make policy recommendations about specific techniques companies should apply.
Mechinterp is often no more advanced than where the EAs were in 2022.
Seems pretty false to me, ICML just rejected a bunch of the good submissions lol. I think that eg sparse autoencoders are a massive advance in the last year that unlocks a lot of exciting stuff
I agree, there were some good papers, and mechinterp as a field is definitely more advanced. What I meant to say was that many of the mechinterp papers accepted to the conference weren’t very good.
Ah, gotcha. Yes, agreed. Mech interp peer review is generally garbage and does a bad job of filtering for quality (though I think it was reasonable enough at the workshop!)
Quick takes from ICML 2024 in Vienna:
In the main conference, there were tons of papers mentioning safety/alignment but few of them are good as alignment has become a buzzword. Mechinterp is often no more advanced than where the EAs were in 2022.
Lots of progress on debate. On the empirical side, a debate paper got an oral. On the theory side, Jonah Brown-Cohen of Deepmind proves that debate can be efficient even when the thing being debated is stochastic, a version of this paper from last year. Apparently there has been some progress on obfuscated arguments too.
The Next Generation of AI Safety Workshop was kind of a mishmash of various topics associated with safety. Most of them were not related to x-risk, but there was interesting work on unlearning and other topics.
The Causal Incentives Group at Deepmind developed a quantitative measure of goal-directedness, which seems promising for evals.
Reception to my Catastrophic Goodhart paper was decent. An information theorist said there were good theoretical reasons the two settings we studied—KL divergence and best-of-n—behaved similarly.
OpenAI gave a disappointing safety presentation at NGAIS touting their new technique of rules-based rewards, which is a variant of constitutional AI and seems really unambitious.
The mechinterp workshop often had higher-quality papers than the main conference. It was completely full. Posters were right next to each other and the room was so packed during talks they didn’t let people in.
I missed a lot of the workshop, so I need to read some posters before having takes.
My opinions on the state of published AI safety work:
Mechinterp is progressing but continues to need feedback loops, either from benchmarks (I’m excited about people building on our paper InterpBench) or downstream tasks where mechinterp outperforms fine-tuning alone.
Most of the danger from AI comes from goal-directed agents and instrumental convergence. There is little research now because we don’t have agents yet. In 1-3 years, foundation model agents will be good enough to study, and we need to be ready with the right questions and theoretical frameworks.
We still do not know enough about AI safety to make policy recommendations about specific techniques companies should apply.
Seems pretty false to me, ICML just rejected a bunch of the good submissions lol. I think that eg sparse autoencoders are a massive advance in the last year that unlocks a lot of exciting stuff
I agree, there were some good papers, and mechinterp as a field is definitely more advanced. What I meant to say was that many of the mechinterp papers accepted to the conference weren’t very good.
(This is what I understood you to be saying)
Ah, gotcha. Yes, agreed. Mech interp peer review is generally garbage and does a bad job of filtering for quality (though I think it was reasonable enough at the workshop!)
What does ‘foundation model’ mean here?
Multimodal language models. We can already study narrow RL agents, but the intersection with alignment is not a hot area.