RobertKirk comments on Existential AI Safety is NOT separate from near-term applications

RobertKirk 20 Dec 2022 15:47 UTC
LW: 1 AF: 1
0
AF
So I struggle to think of an example of a tool that is good for finding insidious misaligning bugs but not others.
One example: A tool that is designed to detect whether a model is being truthful (correctly representing what it knows about the world to the user), perhaps based on specific properties of truthfulness (for example see https://www.alignmentforum.org/posts/L4anhrxjv8j2yRKKp/how-discovering-latent-knowledge-in-language-models-without) doesn’t seem like it would be useful for improving the reliability of self-driving cars, as likely self-driving cars aren’t misaligned in the sense that they could drive perfectly safely but choose not to, but rather are just unable to drive perfectly safely because some of their internal (learned) systems aren’t sufficiently robust.
- scasper 20 Dec 2022 17:16 UTC
  LW: 1 AF: 1
  0
  AF Parent
  Thanks for the comment. I’m inclined to disagree though. The application was for detecting deceptiveness. But the method was a form of contrastive probing. And one could use a similar approach for significantly different applications.
  - RobertKirk 21 Dec 2022 13:39 UTC
    LW: 1 AF: 1
    0
    AF Parent
    This feels kind of like a semantic disagreement to me. To ground it, it’s probably worth considering whether further research on the CCS-style work I posted would also be useful for self-driving cards (or other applications). I think that would depend on whether the work improves the robustness of the contrastive probing regardless of what is being probed for (which would be generically useful), or whether it improves the probing specifically for truthfulness in systems that have a conception of truthfulness , possibly by improving the constraints or adding additional constraints (less useful for other systems). I think both would be good, but I’m uncertain which would be more useful to pursue if one is motivated by reducing x-risk from misalignment.
    - scasper 22 Dec 2022 20:27 UTC
      LW: 2 AF: 2
      0
      AF Parent
      I see your point here about generally strengthening methods versus specifically refining them for an application. I think this is a useful distinction. But I’m all about putting a good amount of emphasis on connections between this work and different applications. I feel like at this point, we have probably cruxed.