Sadly, I’m not confident the answer is “yes,” and this is the main reason I only ~50% endorse this post. Two reasons I’m worried evaluators might fail:
[...]
The world might change in ways that enable new threat models after camelidAI is open-sourced. For example, suppose that camelidAI + GPT-SoTA isn’t dangerous, but camelidAI + GPT-(SoTA+1) (the GPT-SoTA successor system) is dangerous. If GPT-(SoTA+1) comes out a few months after camelidAI is open-sourced, this seems like bad news.
My main concern here is that that there will be technical advancements in the world in things like finetuning or scaffolding and these will make camelidAI sufficiently capable to be a concern. This seems quite unlikely for current open-source models (as they are far from sufficiently capable), but will increase in probability as open source models get more powerful. E.g., it doesn’t seem that unlikely to me that advances in finetuning, dataset construction, and scaffolding are sufficient for GPT4 to make lots of money doing cybercrime online (this threat model isn’t very existentially concerning, but the stretch from here to existential concerns isn’t that huge).
It’s hard for me to be very confident (>99%) that there won’t be substantial jumpy improvements along these lines. As there are probably larger threats other than open source, maybe we should just eat the small fraction of worlds (maybe 1-5%) where a sudden jump like this happens (it probably wouldn’t be existential even conditional on large jumps). I’m sympathetic to not worrying much about 1/1000 or 1⁄100 doom from open sourcing when we probably have bigger problems...
Let’s also call the most capable proprietary AI system GPT-SoTA, which we can assume is well-behaved. I’m imagining that GPT-SoTA is significantly more capable than camelidAI (and, in particular, is superhuman in most domains). In principle, the protocol below will still make sense if GPT-SoTA is worse than camelidAI (because open source systems have surpassed proprietary ones), but it will degenerate to something like “ban open source AI systems once they are capable of causing significant novel harms which they can’t also reliably mitigate.”
I think a reasonable amount of the concern is going to come from GPT-SoTA stalling out or pausing due to alignment concerns. Then, if open source model continue to advance (either improvements on top of base models like I discussed earlier or further releases which can’t be stopped), we might be in trouble. TBC, I don’t think you were assuming that GPT-SoTA will necessarily keep advancing anywhere, but it seems relevant to note this concern.
We’re starting to have enough experience with the size of improvements produced by fine-tuning, scaffolding, prompting techniques, RAG, advances etc to be able to guesstimate the plausible size of further improvements (and amount of effort involved), so that we can try to leave some appropriate safety margin for it. That doesn’t rule out the possibility of something out-of-distribution coming along, but it does at least reduce it.
My main concern here is that that there will be technical advancements in the world in things like finetuning or scaffolding and these will make camelidAI sufficiently capable to be a concern. This seems quite unlikely for current open-source models (as they are far from sufficiently capable), but will increase in probability as open source models get more powerful. E.g., it doesn’t seem that unlikely to me that advances in finetuning, dataset construction, and scaffolding are sufficient for GPT4 to make lots of money doing cybercrime online (this threat model isn’t very existentially concerning, but the stretch from here to existential concerns isn’t that huge).
It’s hard for me to be very confident (>99%) that there won’t be substantial jumpy improvements along these lines. As there are probably larger threats other than open source, maybe we should just eat the small fraction of worlds (maybe 1-5%) where a sudden jump like this happens (it probably wouldn’t be existential even conditional on large jumps). I’m sympathetic to not worrying much about 1/1000 or 1⁄100 doom from open sourcing when we probably have bigger problems...
I think a reasonable amount of the concern is going to come from GPT-SoTA stalling out or pausing due to alignment concerns. Then, if open source model continue to advance (either improvements on top of base models like I discussed earlier or further releases which can’t be stopped), we might be in trouble. TBC, I don’t think you were assuming that GPT-SoTA will necessarily keep advancing anywhere, but it seems relevant to note this concern.
We’re starting to have enough experience with the size of improvements produced by fine-tuning, scaffolding, prompting techniques, RAG, advances etc to be able to guesstimate the plausible size of further improvements (and amount of effort involved), so that we can try to leave some appropriate safety margin for it. That doesn’t rule out the possibility of something out-of-distribution coming along, but it does at least reduce it.