After some thought, I think making the dataset public is probably a slight net negative, tactically speaking. Benchmarks have sometimes driven progress on the measured ability. Even though monomaniacally trying to get a high score on SAD is safe right now, I don’t really want there to be standard SAD-improving finetuning procedures that can be dropped into any future model. My intuition is that this outweighs benefits from people being able to use your dataset for its original purpose without needing to talk to you, but I’m pretty uncertain.
We thought about this quite a lot, and decided to make the dataset almost entirely public.
It’s not clear to us who would monomaniacally try to maximise SAD score. It’s a dangerous capabilities eval. What we were more worried about is people training for low SAD score in order to make their model seem safer, and such training maybe overfitting to the benchmark and not reducing actual situational awareness by as much as claimed.
It’s also unclear what the sharing policy that we could enforce would be that mitigates these concerns while allowing benefits. For example, we would want top labs to use SAD to measure SA in their models (a lot of the theory of change runs through this). But then we’re already giving the benchmark to the top labs, and they’re the ones doing most of the capabilities work.
More generally, if we don’t have good evals, we are flying blind and don’t know what the LLMs can do. If the cost of having a good understanding of dangerous model capabilities and their prerequisites is that, in theory, someone might be slightly helped in giving models a specific capability (especially when that capability is both emerging by default already, and where there are very limited reasons for anyone to specifically want to boost this ability), then I’m happy to pay that cost. This is especially the case since SAD lets you measure a cluster of dangerous capability prerequisites and therefore for example test things like out-of-context reasoning, unlearning techniques, or activation steering techniques on something that is directly relevant for safety.
Another concern we’ve had is the dataset leaking onto the public internet and being accidentally used in training data. We’ve taken many steps to mitigate this happening. We’ve also kept 20% of the SAD-influence task private, which will hopefully let us detect at least obvious forms of memorisation of SAD (whether through dataset leakage or deliberate fine-tuning).
This seems reasonable, I’m glad you’ve put some thought into this. I think there are situations where training for situational awareness will seem like a good idea to people. It’s only a dangerous capability because it’s so instrumentally useful for navigating the real world, after all. But maybe this was going to be concentrated in top labs anyway.
After some thought, I think making the dataset public is probably a slight net negative, tactically speaking. Benchmarks have sometimes driven progress on the measured ability. Even though monomaniacally trying to get a high score on SAD is safe right now, I don’t really want there to be standard SAD-improving finetuning procedures that can be dropped into any future model. My intuition is that this outweighs benefits from people being able to use your dataset for its original purpose without needing to talk to you, but I’m pretty uncertain.
We thought about this quite a lot, and decided to make the dataset almost entirely public.
It’s not clear to us who would monomaniacally try to maximise SAD score. It’s a dangerous capabilities eval. What we were more worried about is people training for low SAD score in order to make their model seem safer, and such training maybe overfitting to the benchmark and not reducing actual situational awareness by as much as claimed.
It’s also unclear what the sharing policy that we could enforce would be that mitigates these concerns while allowing benefits. For example, we would want top labs to use SAD to measure SA in their models (a lot of the theory of change runs through this). But then we’re already giving the benchmark to the top labs, and they’re the ones doing most of the capabilities work.
More generally, if we don’t have good evals, we are flying blind and don’t know what the LLMs can do. If the cost of having a good understanding of dangerous model capabilities and their prerequisites is that, in theory, someone might be slightly helped in giving models a specific capability (especially when that capability is both emerging by default already, and where there are very limited reasons for anyone to specifically want to boost this ability), then I’m happy to pay that cost. This is especially the case since SAD lets you measure a cluster of dangerous capability prerequisites and therefore for example test things like out-of-context reasoning, unlearning techniques, or activation steering techniques on something that is directly relevant for safety.
Another concern we’ve had is the dataset leaking onto the public internet and being accidentally used in training data. We’ve taken many steps to mitigate this happening. We’ve also kept 20% of the SAD-influence task private, which will hopefully let us detect at least obvious forms of memorisation of SAD (whether through dataset leakage or deliberate fine-tuning).
This seems reasonable, I’m glad you’ve put some thought into this. I think there are situations where training for situational awareness will seem like a good idea to people. It’s only a dangerous capability because it’s so instrumentally useful for navigating the real world, after all. But maybe this was going to be concentrated in top labs anyway.
That github link yields a 404. Is it just an issue with the link itself, or did something change about the dataset being public?
Sorry about that, fixed now