Specifically, I wanted the edit to be a clarification that you only have a <0.1% probability on spontaneous scheming ending the world.
Well, I have <0.1% on spontaneous scheming, period. I suspect Nora is similar and just misspoke in that comment.
If it’s spontaneous then yeah, I don’t expect it to happen ~ever really. I was mainly thinking about cases where people intentionally train models to scheme.
Specifically, I wanted the edit to be a clarification that you only have a <0.1% probability on spontaneous scheming ending the world.
Well, I have <0.1% on spontaneous scheming, period. I suspect Nora is similar and just misspoke in that comment.
If it’s spontaneous then yeah, I don’t expect it to happen ~ever really. I was mainly thinking about cases where people intentionally train models to scheme.