As I said in my post, I’m not suggesting I have solved alignment. I’m simply trying to solve specific problems in the alignment space. Specifically what I’m trying to solve here are two things:
1) Transparency. That’s not to say that you can ever know what a NN really is optimizing for (due to internal optimizers), but you can get them to produce a verifiable output. How you verify the output is a problem in itself, but the first step must be getting something you can verify. 2) Preventing training pressure from creating a system that trends its failure modes to the most extreme outcomes. There are questions on whether this can be done without just creating an expensive brick, and this is what I’m currently investigating. I believe it is possible and scalable, but I have no formal proof of such, and agree it is a valid concern with this approach.
As I said in my post, I’m not suggesting I have solved alignment. I’m simply trying to solve specific problems in the alignment space. Specifically what I’m trying to solve here are two things:
1) Transparency. That’s not to say that you can ever know what a NN really is optimizing for (due to internal optimizers), but you can get them to produce a verifiable output. How you verify the output is a problem in itself, but the first step must be getting something you can verify.
2) Preventing training pressure from creating a system that trends its failure modes to the most extreme outcomes. There are questions on whether this can be done without just creating an expensive brick, and this is what I’m currently investigating. I believe it is possible and scalable, but I have no formal proof of such, and agree it is a valid concern with this approach.