What you’re describing above is how Bitlocker on Windows works on every modern Windows PC. The startup process involves a chain of trust with various bootloaders verifying the next thing to start and handing off keys until windows starts. Crucially, the keys are different if you start something that’s not windows (IE:not signed by Microsoft). You can’t just boot Linux and decrypt the drive since different keys would be generated for Linux during boot and they won’t decrypt the drive.
Mobile devices and game consoles are even more locked down. If there’s no bootloader unlock from your carrier or device manufacturer and no vulnerability (hardware or software) to be found, you’re stuck with the stock OS. You can’t boot something else because the chip will refuse to boot anything not signed by the OEM/Carrier. You can’t downgrade because fuses have been blown to prevent it and enforce a minimum revision number. Nothing booted outside the chip will have the keys locked up inside it needed to decrypt things and do remote attestation.
Root isn’t full control
Having root on a device isn’t the silver bullet it once was. Security is still kind of terrible and people don’t do enough to lock everything down properly, but the modern approach seems to be: 1) isolate security critical properties/code 2) put it in a secure armored protected box somewhere inside the chip. 3) make sure you don’t stuff enough crap inside the box the attacker can compromise via a bug too.
They tend to fail in a couple of ways
Pointless isolation (EG:lock encryption keys inside a box but let the user ask it to encrypt/decrypt anything)
The stuff too much stuff inside the box (EG: a full Java virtual machine and a webserver)
I expect ML accelerator security to be full of holes. They won’t get it right, but it is possible in principle.
ML accelerator security wish-list
As for what we might want for an ML accelerator:
Separating out enforcement of security critical properties:
restrict communications
supervisory code configures/approves communication channels such that all data in/out must be encrypted
There’s still steganography. With complete control over code that can read the weights or activations we can hide data in message timing for example
Still, protecting weights/activations in transit is a good start.
can enforce “who can talk to who”
example:ensure inference outputs must go through supervision model that OKs them as safe.
Supervision inference chip can then send data to API server then to customer
Inference chip literally can’t send data anywhere but the supervision model
can enforce network isolation for dangerous experimental system (though you really should be using airgaps for that)
Enforcing code signing like what apple does. Current Gen GPUs support virtualisation and user/kernel modes. Control what code has access to what data for reading/writing. Code should not have write access to itself. Attackers would have to find return oriented programming attacks or similar that could work on GPU shader code. This makes life harder for the attacker.
Apple does this on newer SOCs to prevent execution of non-signed code.
Doing that would help a lot. Not sure how well that plays with infiniband/NVlink networking but that can be encrypted too in principle. If a virtual memory system is implemented, it’s not that hard to add a field to the page table for an encryption key index.
Boundaries of the trusted system
You’ll need to manage keys and securely communicate with all the accelerator chips. This likely involves hardware security modules that are extra super duper secure. Decryption keys for data chips must work on and for inter-chip communication are sent securely to individual accelerator chips similar to how keys are sent to cable TV boxes to allow them to decrypt programs they have paid for.
This is how you actually enforce access control.
If a message is not for you, if you aren’t supposed to read/write that part of the distributed virtual memory space, you don’t get keys to decrypt it. Simple and effective.
“You” the running code never touch the keys. The supervisory code doesn’t touch the keys. Specialised crypto hardware unwraps the key and then uses it for (en/de)cryption without any software in the chip ever having access to it.
Disk encryption is table stakes. I’ll assume any virtual memory is also encrypted. I don’t know much about that.
I’m assuming no use of flash memory.
Absent homomorphic encryption, we have to decrypt in the registers, or whatever they’re called for a GPU.
So basically the question is how valuable is it to encrypt the weights in RAM and possibly in the processor cache. For the sake of this discussion, I’m going to assume reading from the processor cache is just as hard as reading from the registers, so there’s no point in encrypting the processor cache if we’re going to decrypt in registers anyway. (Also, encrypting the processor cache could really hurt performance!)
So that leaves RAM: how much added security we get if we encrypt RAM in addition to encrypting disk.
One problem I notice: An attacker who has physical read access to RAM may very well also have physical write access to RAM. That allows them to subvert any sort of boot-time security, by rewriting the running OS in RAM.
If the processor can only execute signed code, that could help. But an attacker could still control which signed code the processor runs (by strategically changing the contents at an instruction pointer?) I suspect this is enough in practice.
A somewhat insane idea would be for the OS to run encrypted in RAM to make it harder for an attacker to tamper with it. I doubt this would help—an attacker could probably infer from the pattern of memory accesses which OS code does what. (Assuming they’re able to observe memory accesses.)
So overall it seems like with physical write access to RAM, an attacker can probably get de facto root access, and make the processor their puppet. At that point, I think exfiltrating the weights should be pretty straightforward. I’m assuming intermediate activations must be available for interpretability, so it seems possible to infer intermediate weights by systematically probing intermediate activations and solving for the weights.
If you could run the OS from ROM, so it can’t be tampered with, maybe that could help. I’m assuming no way to rewrite the ROM or swap in a new ROM while the system is running. Of course, that makes OS updates annoying since you have to physically open things up and swap them out. Maybe that introduces new vulnerabilities.
In any case, overall I suspect the benefit-to-effort ratio is higher elsewhere. I would focus on making sure the AI isn’t capable of reading its own RAM in the first place, and isn’t trying to.
It’s hard to build hardware or datacenters that resists sabotage if you don’t do this. You end up having to trust the maintenance people aren’t messing with the equipment and the factories haven’t added any surprises to the PCBs. With the right security hardware, you trust TSMC and their immidiate suppliers and no one else.
Not sure if we have the technical competence to pull it off. Apple’s likely one of the few that’s even close to secure and it took them more than a decade of expensive lessons to get there. Still, we should put in the effort.
in any case, overall I suspect the benefit-to-effort ratio is higher elsewhere. I would focus on making sure the AI isn’t capable of reading its own RAM in the first place, and isn’t trying to.
Agreed that alignment is going to be the harder problem. Considering the amount of fail when it comes to building correct security hardware that operates using known principles … things aren’t looking great.
Memory contents protected with MACs are still vulnerable to tampering through replay attacks. For example, an adversary can replace a tuple of { Data, MAC, Counter } in memory with older values without detection. Integrity-trees [7], [13], [20] prevent replay attacks using multiple levels of MACs in memory, with each level ensuring the integrity of the level below. Each level is smaller than the level below, with the root small enough to be securely stored on-chip.
[improvement TLDR for this paper: they find a data compression scheme for counters to increase the tree branching factor to 128 per level from 64 without increasing re-encryption when counters overflow]
Performance cost
Overheads are usually quite low for CPU workloads:
Executable code can be protected with negligible overhead by increasing the size of the rewritable authenticated blocks for a given counter to 4KB or more. Overhead is then comparable to the page table.
For typical ML workloads, the smallest data block is already 2x larger (GPU cache lines 128 bytes vs 64 bytes on CPU gives 2x reduction). Access patterns should be nice too, large contiguous reads/writes.
Only some unusual workloads see significant slowdown (EG: large graph traversal/modification) but this can be on the order of 3x.[2]
A real example (intel SGX)
Use case: launch an application in a “secure enclave” so that host operating system can’t read it or tamper with it.
It used an older memory protection scheme:
hash tree
Each 64 byte chunk of memory is protected by an 8 byte MAC
MACs are 8x smaller than the data they protect so each tree level is 8x smaller
split counter modes in the linked paper can do 128x per level
The memory encryption works.
If intel SGX enclave memory is modified, this is detected and the whole CPU locks up.
ECC (error correcting code) memory includes extra chips to store error correction data. Repurposing this to store MACs +10 bit ECC gets rid of extra data accesses. As long as we have the counters cached for that section of memory there’s no extra overhead. Tradeoff is going from ability to correct 2 flipped bits to only correcting 1 flipped bit.
DRAM internally reads/writes lots of data at once as part of a row. standard DDR memory reads 8KB rows rather than just a 64B cache line. We can store tree parts for a row inside that row to reduce read/write latency/cost since row switches are expensive.
Memory that is rarely changed (EG:executable code) can be protected as a single large block. If we don’t need to read/write individual 64 byte chunks, then a single 4KiB page can be a rewritable unit. Overhead is then negligible if you can store MACs in place of an ECC code.
Counters don’t need to be encrypted, just authenticated. verification can be parallelized if you have the memory bandwidth to do so.
can also delay verification and assume data is fine, if verification fails, entire chip shuts down to prevent bad results from getting out.
Technically we need +12.5% to store MAC tags. If we assume ECC (error correcting code) memory is in use, which already has +12.5% for ECC, we can store MAC tags + smaller ECC at the cost of 1 bit of error correction.
Random reads/writes bloat memory traffic by >3x since we need to deal with 2+ uncached tree levels. We can hide latency by delaying verify of higher tree levels and panicking if it fails before results can leave chip (Intel SGX does exactly this). But if most traffic bloats, we bottleneck on memory bandwidth and perf drops a lot.
What you’re describing above is how Bitlocker on Windows works on every modern Windows PC. The startup process involves a chain of trust with various bootloaders verifying the next thing to start and handing off keys until windows starts. Crucially, the keys are different if you start something that’s not windows (IE:not signed by Microsoft). You can’t just boot Linux and decrypt the drive since different keys would be generated for Linux during boot and they won’t decrypt the drive.
Mobile devices and game consoles are even more locked down. If there’s no bootloader unlock from your carrier or device manufacturer and no vulnerability (hardware or software) to be found, you’re stuck with the stock OS. You can’t boot something else because the chip will refuse to boot anything not signed by the OEM/Carrier. You can’t downgrade because fuses have been blown to prevent it and enforce a minimum revision number. Nothing booted outside the chip will have the keys locked up inside it needed to decrypt things and do remote attestation.
Root isn’t full control
Having root on a device isn’t the silver bullet it once was. Security is still kind of terrible and people don’t do enough to lock everything down properly, but the modern approach seems to be: 1) isolate security critical properties/code 2) put it in a secure armored protected box somewhere inside the chip. 3) make sure you don’t stuff enough crap inside the box the attacker can compromise via a bug too.
They tend to fail in a couple of ways
Pointless isolation (EG:lock encryption keys inside a box but let the user ask it to encrypt/decrypt anything)
the box isn’t hack proof (examples below)
they forget about a chip debug feature that lets you read/write security critical device memory
you can mess with chip operating voltage/frequency to make security critical operations fail.
The stuff too much stuff inside the box (EG: a full Java virtual machine and a webserver)
I expect ML accelerator security to be full of holes. They won’t get it right, but it is possible in principle.
ML accelerator security wish-list
As for what we might want for an ML accelerator:
Separating out enforcement of security critical properties:
restrict communications
supervisory code configures/approves communication channels such that all data in/out must be encrypted
There’s still steganography. With complete control over code that can read the weights or activations we can hide data in message timing for example
Still, protecting weights/activations in transit is a good start.
can enforce “who can talk to who”
example:ensure inference outputs must go through supervision model that OKs them as safe.
Supervision inference chip can then send data to API server then to customer
Inference chip literally can’t send data anywhere but the supervision model
can enforce network isolation for dangerous experimental system (though you really should be using airgaps for that)
Enforcing code signing like what apple does. Current Gen GPUs support virtualisation and user/kernel modes. Control what code has access to what data for reading/writing. Code should not have write access to itself. Attackers would have to find return oriented programming attacks or similar that could work on GPU shader code. This makes life harder for the attacker.
Apple does this on newer SOCs to prevent execution of non-signed code.
Doing that would help a lot. Not sure how well that plays with infiniband/NVlink networking but that can be encrypted too in principle. If a virtual memory system is implemented, it’s not that hard to add a field to the page table for an encryption key index.
Boundaries of the trusted system
You’ll need to manage keys and securely communicate with all the accelerator chips. This likely involves hardware security modules that are extra super duper secure. Decryption keys for data chips must work on and for inter-chip communication are sent securely to individual accelerator chips similar to how keys are sent to cable TV boxes to allow them to decrypt programs they have paid for.
This is how you actually enforce access control.
If a message is not for you, if you aren’t supposed to read/write that part of the distributed virtual memory space, you don’t get keys to decrypt it. Simple and effective.
“You” the running code never touch the keys. The supervisory code doesn’t touch the keys. Specialised crypto hardware unwraps the key and then uses it for (en/de)cryption without any software in the chip ever having access to it.
I appreciate your replies. I had some more time to think and now I have more takes. This isn’t my area, but I’m having fun thinking about it.
See https://en.wikipedia.org/wiki/File:ComputerMemoryHierarchy.svg
Disk encryption is table stakes. I’ll assume any virtual memory is also encrypted. I don’t know much about that.
I’m assuming no use of flash memory.
Absent homomorphic encryption, we have to decrypt in the registers, or whatever they’re called for a GPU.
So basically the question is how valuable is it to encrypt the weights in RAM and possibly in the processor cache. For the sake of this discussion, I’m going to assume reading from the processor cache is just as hard as reading from the registers, so there’s no point in encrypting the processor cache if we’re going to decrypt in registers anyway. (Also, encrypting the processor cache could really hurt performance!)
So that leaves RAM: how much added security we get if we encrypt RAM in addition to encrypting disk.
One problem I notice: An attacker who has physical read access to RAM may very well also have physical write access to RAM. That allows them to subvert any sort of boot-time security, by rewriting the running OS in RAM.
If the processor can only execute signed code, that could help. But an attacker could still control which signed code the processor runs (by strategically changing the contents at an instruction pointer?) I suspect this is enough in practice.
A somewhat insane idea would be for the OS to run encrypted in RAM to make it harder for an attacker to tamper with it. I doubt this would help—an attacker could probably infer from the pattern of memory accesses which OS code does what. (Assuming they’re able to observe memory accesses.)
So overall it seems like with physical write access to RAM, an attacker can probably get de facto root access, and make the processor their puppet. At that point, I think exfiltrating the weights should be pretty straightforward. I’m assuming intermediate activations must be available for interpretability, so it seems possible to infer intermediate weights by systematically probing intermediate activations and solving for the weights.
If you could run the OS from ROM, so it can’t be tampered with, maybe that could help. I’m assuming no way to rewrite the ROM or swap in a new ROM while the system is running. Of course, that makes OS updates annoying since you have to physically open things up and swap them out. Maybe that introduces new vulnerabilities.
In any case, overall I suspect the benefit-to-effort ratio is higher elsewhere. I would focus on making sure the AI isn’t capable of reading its own RAM in the first place, and isn’t trying to.
TLDR:Memory encryption alone is indeed not enough. Modifications and rollback must be prevented too.
memory encryption and authentication has come a long way
Unless there’s a massive shift in ML architectures to doing lots of tiny reads/writes, overheads will be tiny. I’d guesstimate the following:
negligible performance drop / chip area increase
~1% of DRAM and cache space[1]
It’s hard to build hardware or datacenters that resists sabotage if you don’t do this. You end up having to trust the maintenance people aren’t messing with the equipment and the factories haven’t added any surprises to the PCBs. With the right security hardware, you trust TSMC and their immidiate suppliers and no one else.
Not sure if we have the technical competence to pull it off. Apple’s likely one of the few that’s even close to secure and it took them more than a decade of expensive lessons to get there. Still, we should put in the effort.
Agreed that alignment is going to be the harder problem. Considering the amount of fail when it comes to building correct security hardware that operates using known principles … things aren’t looking great.
</TLDR> rest of comment is just details
Morphable Counters: Enabling Compact Integrity Trees For Low-Overhead Secure Memories
Performance cost
Overheads are usually quite low for CPU workloads:
<1% extra DRAM required[1]
<<10% execution time increase
Executable code can be protected with negligible overhead by increasing the size of the rewritable authenticated blocks for a given counter to 4KB or more. Overhead is then comparable to the page table.
For typical ML workloads, the smallest data block is already 2x larger (GPU cache lines 128 bytes vs 64 bytes on CPU gives 2x reduction). Access patterns should be nice too, large contiguous reads/writes.
Only some unusual workloads see significant slowdown (EG: large graph traversal/modification) but this can be on the order of 3x.[2]
A real example (intel SGX)
Use case: launch an application in a “secure enclave” so that host operating system can’t read it or tamper with it.
It used an older memory protection scheme:
hash tree
Each 64 byte chunk of memory is protected by an 8 byte MAC
MACs are 8x smaller than the data they protect so each tree level is 8x smaller
split counter modes in the linked paper can do 128x per level
The memory encryption works.
If intel SGX enclave memory is modified, this is detected and the whole CPU locks up.
SGX was not secure. The memory encryption/authentication is solid. The rest … not so much. Wikipedia lists 8 separate vulnerabilities including ones that allow leaking of the remote attestation keys. That’s before you get to attacks on other parts of the chip and security software that allow dumping all the keys stored on chip allowing complete emulation.
AMD didn’t do any better of course One Glitch to Rule Them All: Fault Injection Attacks Against AMD’s Secure Encrypted Virtualization
How low overheads are achieved
ECC (error correcting code) memory includes extra chips to store error correction data. Repurposing this to store MACs +10 bit ECC gets rid of extra data accesses. As long as we have the counters cached for that section of memory there’s no extra overhead. Tradeoff is going from ability to correct 2 flipped bits to only correcting 1 flipped bit.
DRAM internally reads/writes lots of data at once as part of a row. standard DDR memory reads 8KB rows rather than just a 64B cache line. We can store tree parts for a row inside that row to reduce read/write latency/cost since row switches are expensive.
Memory that is rarely changed (EG:executable code) can be protected as a single large block. If we don’t need to read/write individual 64 byte chunks, then a single 4KiB page can be a rewritable unit. Overhead is then negligible if you can store MACs in place of an ECC code.
Counters don’t need to be encrypted, just authenticated. verification can be parallelized if you have the memory bandwidth to do so.
can also delay verification and assume data is fine, if verification fails, entire chip shuts down to prevent bad results from getting out.
Technically we need +12.5% to store MAC tags. If we assume ECC (error correcting code) memory is in use, which already has +12.5% for ECC, we can store MAC tags + smaller ECC at the cost of 1 bit of error correction.
Random reads/writes bloat memory traffic by >3x since we need to deal with 2+ uncached tree levels. We can hide latency by delaying verify of higher tree levels and panicking if it fails before results can leave chip (Intel SGX does exactly this). But if most traffic bloats, we bottleneck on memory bandwidth and perf drops a lot.