Ah, I see. I still think it’s worth thinking about how developers might try to pass this test (and others) adversarially, in order to deploy a model that they themselves are confident is harmful in at least some ways.
I think such intentional deployments are much less likely to be totally catastrophic compared to accident risk, but still risky, and not totally implausible (they don’t necessarily require extreme malice or misanthropy on the part of developers, just a misguided or negligent view of safety concerns, perhaps due to motivated reasoning due to career or profit potential.)
That’s right. This test isn’t meant to cover “and it’s safe”, it’s meant to cover “and it’s predictable.” I meant to cover this in:
Ah, I see. I still think it’s worth thinking about how developers might try to pass this test (and others) adversarially, in order to deploy a model that they themselves are confident is harmful in at least some ways.
I think such intentional deployments are much less likely to be totally catastrophic compared to accident risk, but still risky, and not totally implausible (they don’t necessarily require extreme malice or misanthropy on the part of developers, just a misguided or negligent view of safety concerns, perhaps due to motivated reasoning due to career or profit potential.)