Yeah, I’m particularly interested in scalable oversight over long-horizon tasks and chain-of-thought faithfulness. I’d probably be pretty open to a wide range of safety-relevant topics though.
In general, what gets me most excited about AI research is trying to come up with the perfect training scheme to incentivize the AI to learn what you want it to—things like HCH, Debate, and the ELK contest were really cool to me. So I’m a bit less interested in areas like mechanistic interpretability or very theoretical math
While I’d be surprised to hear about something like this happening, I wouldn’t be that surprised. But in this case, it seems pretty clear that o1 is correcting its own mistakes in a way that past GPTs essentially never did, if you look at the CoT examples in the announcement (e.g. the “Cipher” example).