We already can’t make MNIST digit recognizers secure against adversarial attacks. We don’t know how to prevent prompt injection. Convnets are vulnerable to adversarial attacks. RL agents that play Go at superhuman levels are vulnerable to simple strategies that exploit gaps in their cognition.
None of those things are examples of misalignment except arguably prompt injection, which seems like it’s being solved by OpenAI with ordinary engineering.
None of those things are examples of misalignment except arguably prompt injection, which seems like it’s being solved by OpenAI with ordinary engineering.