First: why do software engineers use worst-case reasoning?
A joking answer would be “the users are adversaries”. For most software this isn’t literally true; the users don’t want to break the software. But users are optimizing for things, and optimization in general tends to find corner cases. (In linear programming, for instance, almost all objectives will be maximized at a literal corner of the set allowed by the constraints.) This is sort of like “being optimized against”, but it emphasizes that the optimizer need not be “adversarial” in the intuitive sense of the word in order to have that effect.
Users do a lot of different things, and “corner cases” tend to come up a lot more often than a naive analysis might think. If a user is weird in one way, they’re more likely to be weird in another way too. This is sort of like “the space contains a high proportion of bad things”, but with more emphasis on the points in the space being weighted in ways which weight Weirdness more than a naive analysis would suggest.
Software engineers often want to provide simple, predictable APIs. Error cases (especially unexpected error cases) make APIs more complex.
In software, we tend to have a whole tech stack. Even if each component of the stack fails only rarely, overall failure can still be extremely common if there’s enough pieces any one of which can break the whole thing. (I worked at a mortgage startup where this was a big problem—we used a dozen external APIs which were each fine 95+% of the time, but that still meant our app was down very frequently overall.) So, we need each individual component to be very highly reliable.
And one more, generated by thinking about some of my own use-cases:
Unknown unknowns. Worst-case reasoning forces people to consider all the possible failure modes, and rule out any unknown unknowns.
These all carry over to alignment pretty straightforwardly.
A few more reasons...
First: why do software engineers use worst-case reasoning?
A joking answer would be “the users are adversaries”. For most software this isn’t literally true; the users don’t want to break the software. But users are optimizing for things, and optimization in general tends to find corner cases. (In linear programming, for instance, almost all objectives will be maximized at a literal corner of the set allowed by the constraints.) This is sort of like “being optimized against”, but it emphasizes that the optimizer need not be “adversarial” in the intuitive sense of the word in order to have that effect.
Users do a lot of different things, and “corner cases” tend to come up a lot more often than a naive analysis might think. If a user is weird in one way, they’re more likely to be weird in another way too. This is sort of like “the space contains a high proportion of bad things”, but with more emphasis on the points in the space being weighted in ways which weight Weirdness more than a naive analysis would suggest.
Software engineers often want to provide simple, predictable APIs. Error cases (especially unexpected error cases) make APIs more complex.
In software, we tend to have a whole tech stack. Even if each component of the stack fails only rarely, overall failure can still be extremely common if there’s enough pieces any one of which can break the whole thing. (I worked at a mortgage startup where this was a big problem—we used a dozen external APIs which were each fine 95+% of the time, but that still meant our app was down very frequently overall.) So, we need each individual component to be very highly reliable.
And one more, generated by thinking about some of my own use-cases:
Unknown unknowns. Worst-case reasoning forces people to consider all the possible failure modes, and rule out any unknown unknowns.
These all carry over to alignment pretty straightforwardly.