it doesn’t seem like an accident to me that trying to understand neural networks pushes towards capability improvement. I really believe that absolutely all safety techniques, with no possible exceptions even in principle, are necessarily capability techniques. everyone talks about an “alignment tax”, but shouldn’t we instead be talking about removal of spurious anticapability? deceptively aligned submodules are not capable, they are anti-capable!
it doesn’t seem like an accident to me that trying to understand neural networks pushes towards capability improvement. I really believe that absolutely all safety techniques, with no possible exceptions even in principle, are necessarily capability techniques. everyone talks about an “alignment tax”, but shouldn’t we instead be talking about removal of spurious anticapability? deceptively aligned submodules are not capable, they are anti-capable!