Thoth Hermes comments on If interpretability research goes well, it may get dangerous

Thoth Hermes 4 Apr 2023 13:00 UTC
2 points
1
One major capabilities hurdle that’s related to interpretability: The difference between manually “opening up” the model to analyze its weights, etc., and being able to literally ask the model questions about why it did certain things.
- Matt Goldenberg 4 Apr 2023 15:08 UTC
  4 points
  3
  Parent
  And it seems like a path to solving that is to have the AI be able to analyze its own workinga, which seems like a potential path to recursive self improvement as well