Joe Collman comments on [Link Post] “Foundational Challenges in Assuring Alignment and Safety of Large Language Models”

Joe Collman 6 Jun 2024 22:20 UTC
LW: 9 AF: 7
3
AF
I think this is great overall.
One area I’d ideally prefer a clearer presentation/framing is “Safety/performance trade-offs”.
I agree that it’s better than “alignment tax”, but I think it shares one of the core downsides:
- If we say “alignment tax” many people will conclude [“we can pay the tax and achieve alignment” and “the alignment tax isn’t infinite”].
- If we say “Safety/performance trade-offs” many people will conclude [“we know how to make systems safe, so long as we’re willing to sacrifice performance” and “performance sacrifice won’t imply any hard limit on capability”]
I’m not claiming that this is logically implied by “Safety/performance trade-offs”.
I am claiming it’s what most people will imagine by default.
I don’t think this is a problem for near-term LLM safety.
I do think it’s a problem if this way of thinking gets ingrained in those thinking about governance (most of whom won’t be reading the papers that contain all the caveats, details and clarifications).
I don’t have a pithy description that captures the same idea without being misleading.
What I’d want to convey is something like “[lower bound on risk] / performance trade-offs”.
- David Scott Krueger (formerly: capybaralet) 7 Jun 2024 1:39 UTC
  LW: 4 AF: 2
  0
  AF Parent
  Really interesting point!
  
  I introduced this term in my slides that included “paperweight” as an example of an “AI system” that maximizes safety.
  
  I sort of still think it’s an OK term, but I’m sure I will keep thinking about this going forward and hope we can arrive at an even better term.