Ruby comments on Ruby’s Quick Takes

Ruby 9 Feb 2023 1:55 UTC
8 points

I want to register a weak but nonzero prediction that Anthropic’s interpretability publication of A Mathematical Framework for Transformers Circuits will turn out to lead to large capabilities gains and in hindsight will be regarded as a rather bad move that it was published.
Something like we’ll have capabilities-advancing papers citing it and using its framework to justify architecture improvements.
- the gears to ascension 9 Feb 2023 2:00 UTC
  4 points
  Parent
  Agreed, and I don’t think this is bad, nor that they did anything but become the people to implement what the zeitgeist demanded. It was the obvious next step, if they hadn’t done it, someone else who cared less about trying to use it to make systems actually do what humans want would have done it. So the question is, are they going to release their work for others to use, or just hoard it until someone less scrupulous releases their models? It’s looking like they’re trying to keep it “in the family” so only corporations can use it. Kinda concerning.
  If human understandability hadn’t happened, the next step might have been entirely automated sparsification, and those don’t necessarily produce anything humans can use to understand. Distillation into understandable models is an extremely powerful trajectory.
- riceissa 11 Mar 2023 21:42 UTC
  2 points
  Parent
  Not the same paper, but related: https://twitter.com/jamespayor/status/1634447672303304705