Also, I’d love to see research that simulates the position of a company trying to monitor misuse, and allows for the full range of defenses that you proposed. There could be a dataset of 1 trillion queries containing 100 malicious queries. Perhaps each query is accompanied by a KYC ID. Their first line of defense would be robust refusal to cause harm, and the second line would be automated detection of adversarial attacks. The company could also have a budget which can be spent on “human monitoring,” which would give them access to the ground truth label of whether a query is malicious for a fixed price. I’d have to think about which elements would be the most tractable for making research progress, but the fact that AI companies need to solve this problem suggests that perhaps external researchers should work on it too.
Also, I’d love to see research that simulates the position of a company trying to monitor misuse, and allows for the full range of defenses that you proposed. There could be a dataset of 1 trillion queries containing 100 malicious queries. Perhaps each query is accompanied by a KYC ID. Their first line of defense would be robust refusal to cause harm, and the second line would be automated detection of adversarial attacks. The company could also have a budget which can be spent on “human monitoring,” which would give them access to the ground truth label of whether a query is malicious for a fixed price. I’d have to think about which elements would be the most tractable for making research progress, but the fact that AI companies need to solve this problem suggests that perhaps external researchers should work on it too.