Hi!
I think I’m probably in a pretty similar position to where you were maybe a few months/a year ago in that I am a CS grad (though sadly no ML specialisation) working in industry who recently started reading a lot of mechanistic intepretability research, and is starting to seriously consider pursuing a PHD in that area (and also am looking at how I could get some initial research done in the meantime).
Could I DM you to maybe get some advice?
julius vidal
Karma: 1
>As originally conceived, this is sort of like a “dangerous capability” eval for steg.
I am actually just about to start building something very similar to this for the AISI’s evals bounty program.