Chris_Leong comments on DanielFilan’s Shortform Feed

Chris_Leong 14 Nov 2024 13:38 UTC
LW: 6 AF: 4
2
AF
Thanks for writing this. I think it is a useful model. However, there is one thing I want to push back against:
Looking at behaviour is conceptually straightforward, and valuable, and being done
I agree with Apollo Research that evals isn’t really a science yet. It mostly seems to be conducted according to vibes. Model internals could help with this, but things like building experience or auditing models using different schemes and comparing them could help make this more scientific.
Similarly, a lot of work with Model Organisms of Alignment requires a lot of careful thought to get right.
- DanielFilan 14 Nov 2024 17:40 UTC
  LW: 4 AF: 3
  0
  AF Parent
  When I wrote that, I wasn’t thinking so much about evals / model organisms as stuff like:
  - putting a bunch of agents in a simulated world and seeing how they interact
  - weak-to-strong / easy-to-hard generalization
  basically stuff along the lines of “when you put agents in X situation, they tend to do Y thing”, rather than trying to understand latent causes / capabilities