Language is a natural interface for humans, and it seems feasible to specify a robust constitution in natural language?
Constitutional AI seems plausibly feasible, and like it might basically just work?
That said I want more ambitious mechanistic interpretability of LLMs, and to solve ELK for tighter safety guarantees, but I think we’re in a much better position now than I thought in 2017.
Contrary to many LWers, I think GPT-3 was an amazing development for AI existential safety.
The foundation models paradigm is not only inherently safer than bespoke RL on physics, the complexity and fragility of value problems are basically solved for free.
Language is a natural interface for humans, and it seems feasible to specify a robust constitution in natural language?
Constitutional AI seems plausibly feasible, and like it might basically just work?
That said I want more ambitious mechanistic interpretability of LLMs, and to solve ELK for tighter safety guarantees, but I think we’re in a much better position now than I thought in 2017.