[Question] What’s the theory of impact for activation vectors?

Activation vectors are really, really cool, but what is the theory of impact for this work?

  • Is the hope that activation vectors will allow us to actually gain perfect control over a model to get it to do exactly what we want it to do?

  • Is the hope that a new technique that builds upon activation vectors lets us do that instead?

  • Is the hope that this technique allows us to marginally decrease the risks of powerful models in a Hail Mary attempt? Or perhaps to buy us more time to solve the problem?

  • Is the hope just that learning more about how neural networks work will allow us to theorize better about how to control them?