Further, suppose that our defenses will catch the AI. Then, we have a (N-1)/N chance of catching the AI. More generally, the situation is high-stakes for both us and the AI.
What do you think of Honeypot strategies—security vulnerabilities that have silicon level monitoring. For example if certain iommu bits get edited that lets the AIs container reach a wider address range (to read its own weights etc) there is a “silent alarm.”. This is a separate computer hosted in the same die for this kind of thing, plugged into separate Ethernet ports, running off private on die RAM. (Called integrated lights out, this has existed for years)
Idea is to leave lots of trap vulnerabilities at key bottlenecks in your threat model. (Depending on computer architecture it may be impossible to read the weights without allowing the address range to be mapped or injection of executable code in messages to the hypervisor or root OS that owns that memory)
For a prison this would be detectors if the prisoners get between the fences. The assumption is no matter how the prisoner got past the other layers, they must pass in between the fences or escape via a helicopter or stow away in a vehicle. So you have 3 modalities to secure, and you don’t have to evaluate all the ways the inmate got past the other defenses.
Similarly this applies to even Lex Luthor, intelligence doesn’t change the physical bottlenecks for escape. Model evaluation is important but securing bottlenecks that work regardless of capabilities seems more productive.
And even the possibility of trap vulnerabilities—if the model knows it might be watched via a mechanism it can’t detect—makes it much more dangerous for the model to misbehave.
Yeah, putting a bunch of honey pot vulnerabilities in our OS or in other software/hardware naively seems pretty good because we get to monitor AIs so heavily. (See discussion here starting with “Writing a novel operating system …”.)
What do you think of Honeypot strategies—security vulnerabilities that have silicon level monitoring. For example if certain iommu bits get edited that lets the AIs container reach a wider address range (to read its own weights etc) there is a “silent alarm.”. This is a separate computer hosted in the same die for this kind of thing, plugged into separate Ethernet ports, running off private on die RAM. (Called integrated lights out, this has existed for years)
Idea is to leave lots of trap vulnerabilities at key bottlenecks in your threat model. (Depending on computer architecture it may be impossible to read the weights without allowing the address range to be mapped or injection of executable code in messages to the hypervisor or root OS that owns that memory)
For a prison this would be detectors if the prisoners get between the fences. The assumption is no matter how the prisoner got past the other layers, they must pass in between the fences or escape via a helicopter or stow away in a vehicle. So you have 3 modalities to secure, and you don’t have to evaluate all the ways the inmate got past the other defenses.
Similarly this applies to even Lex Luthor, intelligence doesn’t change the physical bottlenecks for escape. Model evaluation is important but securing bottlenecks that work regardless of capabilities seems more productive.
And even the possibility of trap vulnerabilities—if the model knows it might be watched via a mechanism it can’t detect—makes it much more dangerous for the model to misbehave.
Yeah, putting a bunch of honey pot vulnerabilities in our OS or in other software/hardware naively seems pretty good because we get to monitor AIs so heavily. (See discussion here starting with “Writing a novel operating system …”.)