Harrison Dorn comments on Managing catastrophic misuse without robust AIs

Harrison Dorn 9 Jun 2024 3:12 UTC
LW: 3 AF: 1
0
AF
First post, hello! I can only hope that frontier companies are thinking about this as deeply and not focusing efforts on safety-washing. I am no expert in biology so excuse me if I make some basic mistakes, but I am curious about this topic, and I have some concerns and ideas on the practical implementation of this sort of system. I am imagining I am a biology student, let’s say I’m studying something incredibly suspicious, like engineering virus DNA to create vaccines. It would be frustrating to pay for a month of this service, accidentally trigger the bioterror risk, and get temporarily banned. I would instead choose a competing frontier bio-model with lower safety guardrails, and I might even pay more for that service if I know I have less of a chance of being banned.

Thus, I argue that companies developing biology-focused models have the perverse incentive to ignore more risks in order to capture a wider market. Sure, they could limit their product to industry experts, but that leaves room for another company to market to graduates and undergraduates in the field, which is exactly where you find all the least trustworthy people who could plausibly build bioweapons. Why would users ever choose a safer system that occasionally bans them?

Actually, I thought of a case in which they might. Again, I’m thinking back to a hypothetical biology student. I’ve finished my thesis thanks in part to the biology LLM that has partnered with my university and is working closely with the faculty. My fellow students often trip up the system, and when that happens we sometimes get frustrated, but most of the time, we beam like we’ve bought a lottery ticket. That’s because, as the message on the “you are temporarily banned from the API” screen states, the company is using the university to help outsource its red-teaming and testing, so if you can cause a major bug or jailbreak to be discovered, you could win real money. (Perhaps by successfully provoking the model to exceed some percent risk threshold of behavior, or self-reporting if it doesn’t flag it automatically).

More commonly, subscriptions are simply refunded for the month if you helped discover a novel bug, and because the temporary bans are so short, some of the students have taken to trying to come up with creative jailbreaks when they don’t need to use it. Trolling and reused methods would not be rewarded, only banned. Because I finished my thesis, me and some friends are going to spend the weekend trying to jailbreak the system and try to get our refunds, and the prize for causing the model to exceed 50% effective bioweapon risk is pretty tempting, nobody’s won it yet. There’s money and bragging rights at stake!

This is just an example scenario, but I’ll just put it out there that biology students seem to be a good candidate for testing these kinds of guardrails, I would assume there are many of them who are creative and would be willing to team up on a hard problem in order to avoid paying a subscription. They would agree to strict scrutiny. And getting banned would only slow down thesis papers, not crucial research in medicine. This could be a fairly cheap way for a potential biology-focused AI company to test its models against the riskiest group of people, while having all the freedom to disable the system entirely for safety, as you’ve proposed.