You forgot to include a sixth counterargument: you might successfully accomplish everything you set out to do, producing dozens of examples of misalignment, but as soon as you present them, everyone working on capabilities excuses them away as being “not real misalignment” for some reason or another.
I have seen more “toy examples” of misalignment than I can count (e.g. goal misgeneralization in the coin run example, deception here, and the not-so-toy example of GPT-4 failing badly as soon as it was deployed out of distribution—with the only thing needed to break it being a less-than-perfect prompt and giving it the name Sydney. We’ve successfully shown AIs can be misaligned in several ways we predicted ahead of time according to theory. Nobody cares, and nobody has used this information to advance alignment research. At this point I’ve concluded AI companies, even ones claiming otherwise, will not care until somebody dies.
You forgot to include a sixth counterargument: you might successfully accomplish everything you set out to do, producing dozens of examples of misalignment, but as soon as you present them, everyone working on capabilities excuses them away as being “not real misalignment” for some reason or another.
I have seen more “toy examples” of misalignment than I can count (e.g. goal misgeneralization in the coin run example, deception here, and the not-so-toy example of GPT-4 failing badly as soon as it was deployed out of distribution—with the only thing needed to break it being a less-than-perfect prompt and giving it the name Sydney. We’ve successfully shown AIs can be misaligned in several ways we predicted ahead of time according to theory. Nobody cares, and nobody has used this information to advance alignment research. At this point I’ve concluded AI companies, even ones claiming otherwise, will not care until somebody dies.