One that checks if individual nodes in the graph are aligned and prunes any that are not
Has “draw the rest of the owl” vibes to me.
If your plan to align AI includes using an AI that can reliably check whether actions are aligned, almost the entirety of the problem passes down to specifying that component.
As I said in my post, I’m not suggesting I have solved alignment. I’m simply trying to solve specific problems in the alignment space. Specifically what I’m trying to solve here are two things:
1) Transparency. That’s not to say that you can ever know what a NN really is optimizing for (due to internal optimizers), but you can get them to produce a verifiable output. How you verify the output is a problem in itself, but the first step must be getting something you can verify. 2) Preventing training pressure from creating a system that trends its failure modes to the most extreme outcomes. There are questions on whether this can be done without just creating an expensive brick, and this is what I’m currently investigating. I believe it is possible and scalable, but I have no formal proof of such, and agree it is a valid concern with this approach.
Has “draw the rest of the owl” vibes to me.
If your plan to align AI includes using an AI that can reliably check whether actions are aligned, almost the entirety of the problem passes down to specifying that component.
As I said in my post, I’m not suggesting I have solved alignment. I’m simply trying to solve specific problems in the alignment space. Specifically what I’m trying to solve here are two things:
1) Transparency. That’s not to say that you can ever know what a NN really is optimizing for (due to internal optimizers), but you can get them to produce a verifiable output. How you verify the output is a problem in itself, but the first step must be getting something you can verify.
2) Preventing training pressure from creating a system that trends its failure modes to the most extreme outcomes. There are questions on whether this can be done without just creating an expensive brick, and this is what I’m currently investigating. I believe it is possible and scalable, but I have no formal proof of such, and agree it is a valid concern with this approach.