Redwood Research and Constellation
Nate Thomas
Karma: 513
Thanks, Neel! It should be fixed now.
Note that it’s unsurprising that a different model categorizes this correctly because the failure was generated from an attack on the particular model we were working with. The relevant question is “given a model, how easy is it to find a failure by attacking that model using our rewriting tools?”
To anyone reading this who wants to work on or discuss FHI-flavored work: Consider applying to Constellation’s programs (the deadline for some of them is today!), which include salaried positions for researchers.