My current goal is one safety-targeted NeurIPS level paper every 4-6 months. It might take me a few months to gain experience to get there, but I’d be happy to be an author on something the quality of the tuned lens,Inference-Time Intervention, IOI or overoptimization papers 3x/year if I could get 30% of the Shapley value each time (meaning some of those are first authors).
On a slightly smaller scale I spent 3 weeks * 0.75 FTE on Catastrophic Regressional Goodhart (both the proofs and writeup) and Drake spent about the same time. We also spent 3 weeks on experimental work which didn’t go anywhere. I was disappointed at the time, but considering the good reception to the post and the lack of progress at my concurrent MIRI work it meets my bar in retrospect.
On an even smaller scale, this week I spent one day seeing if I could apply subnetwork probing to Leela Chess Zero and gave up before actually writing experiment code because I hit Python-C++ interoperability issues. This was fine but I would have been sad to spend 3 days the same way, because I should be able to run the whole experiment in 3 days. I think below this scale it’s hard to communicate my standards because it would be less than 1% of a meaningful research output, and it’s hard to tell 0.2% of a project from 0.5% of a project from the outside.
Edit: the last paragraph was planning fallacy, I got stuck a couple more times and the project should now take ~14 days.
My current goal is one safety-targeted NeurIPS level paper every 4-6 months. It might take me a few months to gain experience to get there, but I’d be happy to be an author on something the quality of the tuned lens, Inference-Time Intervention, IOI or overoptimization papers 3x/year if I could get 30% of the Shapley value each time (meaning some of those are first authors).
On a slightly smaller scale I spent 3 weeks * 0.75 FTE on Catastrophic Regressional Goodhart (both the proofs and writeup) and Drake spent about the same time. We also spent 3 weeks on experimental work which didn’t go anywhere. I was disappointed at the time, but considering the good reception to the post and the lack of progress at my concurrent MIRI work it meets my bar in retrospect.
On an even smaller scale, this week I spent one day seeing if I could apply subnetwork probing to Leela Chess Zero and gave up before actually writing experiment code because I hit Python-C++ interoperability issues. This was fine but I would have been sad to spend 3 days the same way, because I should be able to run the whole experiment in 3 days. I think below this scale it’s hard to communicate my standards because it would be less than 1% of a meaningful research output, and it’s hard to tell 0.2% of a project from 0.5% of a project from the outside.
Edit: the last paragraph was planning fallacy, I got stuck a couple more times and the project should now take ~14 days.