davidad comments on Eight Strategies for Tackling the Hard Part of the Alignment Problem

davidad 22 Jul 2023 18:02 UTC
LW: 2 AF: 1
0
AF
I think formal verification belongs in the “requires knowing what failure looks like” category.

For example, in the VNN competition last year, some adversarial robustness properties were formally proven about VGG16. This requires white-box access to the weights, to be sure, but I don’t think it requires understanding “how failure happens”.
- scasper 22 Jul 2023 19:11 UTC
  LW: 1 AF: 1
  2
  AF Parent
  Sounds right, but the problem seems to be semantic. If understanding is taken to mean a human’s comprehension, then I think this is perfectly right. But since the method is mechanistic, it seems difficult nonetheless.
  - davidad 22 Jul 2023 19:15 UTC
    LW: 2 AF: 1
    1
    AF Parent
    Less difficult than ambitious mechanistic interpretability, though, because that requires human comprehension of mechanisms, which is even more difficult.