I think incentives. Based on my recent reading of ‘The Chaos Machine’ by Max Fisher, I think it’s closely linked to continually increasing engagement driving profit. Addictiveness unfortunately leads to more engagement, which in turn leads to profit. Emotive content (clickbait style, extreme things) also increase engagement.
Tools and moderation processes might be expensive on their own, but I think it’s when they start to challenge the underlying business model of ‘More engagement = more profit’ that the companies are in a more uncomfortable position.
Thanks for putting this out there, I’m on the Catalyze program in Lisa in London and have been thinking a lot about AI control (for framing my background is risk management, so coming at it from that angle).
‘The high-level change I propose is to define a control evaluation as a method of bounding risk from worst-case deployment behavior generally – rather than scheming models in particular.’
I’m on board with this, if our goal is to bound the risk from worst-case deployment behaviour, I think the broader the better and enumerating different ‘untrusted model categories’ is going to better inform our understanding of harmful behaviours. If we have post deployment monitoring, and identify new patterns of harmful behaviour, we may even identify new untrusted model categories in this way.
For post deployment monitoring (let’s say the control eval highlighted some behaviour we wanted ongoing control), I think the reason (i.e the untrusted model category causing the behaviour) is less important than the observed behaviour itself. There could even be multiple applying untrusted model categories e.g. scheming and jailbroken, so you could have a single monitor to catch all of these. What are your thoughts on this?
‘control is a broad approach to mitigating and evaluating safety given ’pessimistic assumptions’ about model deployment behavior.’
Rather than ‘pessimistic assumptions’ here I describe these as ‘unknowns’, the probability of which would range somewhere between 0-1 based in our confidence in the representativeness in the evaluations.
I think to make control evaluations practically useful, we need to add more nuance to how we define trusted and untrusted. Coming from risk management (in Fintech and FS), I’m approaching it from a view that a model can only ever be P(Trusted)=1 (i.e. fully trusted) when bounded by certain deployment constraints. So a model is trusted to do something rather than generally trusted e.g. it may be trusted to not write backdoored code, but may not be trusted to keep sensitive data safe if its in a setting where users can jailbreak it.
This could also help justify claims around a ‘rarely misbehaving monitor being safe because its failures will likely be uncorrelated with attacks executed by the policy.’ Let’s say you assign a probability to this of 0.001, which is fine for a use case of coding a website but not ok for coding a banking platform. You could layer more monitors to reduce this probability for this use case, effectively getting the risk within appetite for this deployment setting e.g. to 0.000001.