One area I’d ideally prefer a clearer presentation/framing is “Safety/performance trade-offs”.
I agree that it’s better than “alignment tax”, but I think it shares one of the core downsides:
If we say “alignment tax” many people will conclude [“we can pay the tax and achieve alignment” and “the alignment tax isn’t infinite”].
If we say “Safety/performance trade-offs” many people will conclude [“we know how to make systems safe, so long as we’re willing to sacrifice performance” and “performance sacrifice won’t imply any hard limit on capability”]
I’m not claiming that this is logically implied by “Safety/performance trade-offs”. I am claiming it’s what most people will imagine by default.
I don’t think this is a problem for near-term LLM safety. I do think it’s a problem if this way of thinking gets ingrained in those thinking about governance (most of whom won’t be reading the papers that contain all the caveats, details and clarifications).
I don’t have a pithy description that captures the same idea without being misleading. What I’d want to convey is something like “[lower bound on risk] / performance trade-offs”.
I think this is great overall.
One area I’d ideally prefer a clearer presentation/framing is “Safety/performance trade-offs”.
I agree that it’s better than “alignment tax”, but I think it shares one of the core downsides:
If we say “alignment tax” many people will conclude [“we can pay the tax and achieve alignment” and “the alignment tax isn’t infinite”].
If we say “Safety/performance trade-offs” many people will conclude [“we know how to make systems safe, so long as we’re willing to sacrifice performance” and “performance sacrifice won’t imply any hard limit on capability”]
I’m not claiming that this is logically implied by “Safety/performance trade-offs”.
I am claiming it’s what most people will imagine by default.
I don’t think this is a problem for near-term LLM safety.
I do think it’s a problem if this way of thinking gets ingrained in those thinking about governance (most of whom won’t be reading the papers that contain all the caveats, details and clarifications).
I don’t have a pithy description that captures the same idea without being misleading.
What I’d want to convey is something like “[lower bound on risk] / performance trade-offs”.
Really interesting point!
I introduced this term in my slides that included “paperweight” as an example of an “AI system” that maximizes safety.
I sort of still think it’s an OK term, but I’m sure I will keep thinking about this going forward and hope we can arrive at an even better term.