It’s hard to build hardware or datacenters that resists sabotage if you don’t do this. You end up having to trust the maintenance people aren’t messing with the equipment and the factories haven’t added any surprises to the PCBs. With the right security hardware, you trust TSMC and their immidiate suppliers and no one else.
Not sure if we have the technical competence to pull it off. Apple’s likely one of the few that’s even close to secure and it took them more than a decade of expensive lessons to get there. Still, we should put in the effort.
in any case, overall I suspect the benefit-to-effort ratio is higher elsewhere. I would focus on making sure the AI isn’t capable of reading its own RAM in the first place, and isn’t trying to.
Agreed that alignment is going to be the harder problem. Considering the amount of fail when it comes to building correct security hardware that operates using known principles … things aren’t looking great.
Memory contents protected with MACs are still vulnerable to tampering through replay attacks. For example, an adversary can replace a tuple of { Data, MAC, Counter } in memory with older values without detection. Integrity-trees [7], [13], [20] prevent replay attacks using multiple levels of MACs in memory, with each level ensuring the integrity of the level below. Each level is smaller than the level below, with the root small enough to be securely stored on-chip.
[improvement TLDR for this paper: they find a data compression scheme for counters to increase the tree branching factor to 128 per level from 64 without increasing re-encryption when counters overflow]
Performance cost
Overheads are usually quite low for CPU workloads:
Executable code can be protected with negligible overhead by increasing the size of the rewritable authenticated blocks for a given counter to 4KB or more. Overhead is then comparable to the page table.
For typical ML workloads, the smallest data block is already 2x larger (GPU cache lines 128 bytes vs 64 bytes on CPU gives 2x reduction). Access patterns should be nice too, large contiguous reads/writes.
Only some unusual workloads see significant slowdown (EG: large graph traversal/modification) but this can be on the order of 3x.[2]
A real example (intel SGX)
Use case: launch an application in a “secure enclave” so that host operating system can’t read it or tamper with it.
It used an older memory protection scheme:
hash tree
Each 64 byte chunk of memory is protected by an 8 byte MAC
MACs are 8x smaller than the data they protect so each tree level is 8x smaller
split counter modes in the linked paper can do 128x per level
The memory encryption works.
If intel SGX enclave memory is modified, this is detected and the whole CPU locks up.
ECC (error correcting code) memory includes extra chips to store error correction data. Repurposing this to store MACs +10 bit ECC gets rid of extra data accesses. As long as we have the counters cached for that section of memory there’s no extra overhead. Tradeoff is going from ability to correct 2 flipped bits to only correcting 1 flipped bit.
DRAM internally reads/writes lots of data at once as part of a row. standard DDR memory reads 8KB rows rather than just a 64B cache line. We can store tree parts for a row inside that row to reduce read/write latency/cost since row switches are expensive.
Memory that is rarely changed (EG:executable code) can be protected as a single large block. If we don’t need to read/write individual 64 byte chunks, then a single 4KiB page can be a rewritable unit. Overhead is then negligible if you can store MACs in place of an ECC code.
Counters don’t need to be encrypted, just authenticated. verification can be parallelized if you have the memory bandwidth to do so.
can also delay verification and assume data is fine, if verification fails, entire chip shuts down to prevent bad results from getting out.
Technically we need +12.5% to store MAC tags. If we assume ECC (error correcting code) memory is in use, which already has +12.5% for ECC, we can store MAC tags + smaller ECC at the cost of 1 bit of error correction.
Random reads/writes bloat memory traffic by >3x since we need to deal with 2+ uncached tree levels. We can hide latency by delaying verify of higher tree levels and panicking if it fails before results can leave chip (Intel SGX does exactly this). But if most traffic bloats, we bottleneck on memory bandwidth and perf drops a lot.
TLDR:Memory encryption alone is indeed not enough. Modifications and rollback must be prevented too.
memory encryption and authentication has come a long way
Unless there’s a massive shift in ML architectures to doing lots of tiny reads/writes, overheads will be tiny. I’d guesstimate the following:
negligible performance drop / chip area increase
~1% of DRAM and cache space[1]
It’s hard to build hardware or datacenters that resists sabotage if you don’t do this. You end up having to trust the maintenance people aren’t messing with the equipment and the factories haven’t added any surprises to the PCBs. With the right security hardware, you trust TSMC and their immidiate suppliers and no one else.
Not sure if we have the technical competence to pull it off. Apple’s likely one of the few that’s even close to secure and it took them more than a decade of expensive lessons to get there. Still, we should put in the effort.
Agreed that alignment is going to be the harder problem. Considering the amount of fail when it comes to building correct security hardware that operates using known principles … things aren’t looking great.
</TLDR> rest of comment is just details
Morphable Counters: Enabling Compact Integrity Trees For Low-Overhead Secure Memories
Performance cost
Overheads are usually quite low for CPU workloads:
<1% extra DRAM required[1]
<<10% execution time increase
Executable code can be protected with negligible overhead by increasing the size of the rewritable authenticated blocks for a given counter to 4KB or more. Overhead is then comparable to the page table.
For typical ML workloads, the smallest data block is already 2x larger (GPU cache lines 128 bytes vs 64 bytes on CPU gives 2x reduction). Access patterns should be nice too, large contiguous reads/writes.
Only some unusual workloads see significant slowdown (EG: large graph traversal/modification) but this can be on the order of 3x.[2]
A real example (intel SGX)
Use case: launch an application in a “secure enclave” so that host operating system can’t read it or tamper with it.
It used an older memory protection scheme:
hash tree
Each 64 byte chunk of memory is protected by an 8 byte MAC
MACs are 8x smaller than the data they protect so each tree level is 8x smaller
split counter modes in the linked paper can do 128x per level
The memory encryption works.
If intel SGX enclave memory is modified, this is detected and the whole CPU locks up.
SGX was not secure. The memory encryption/authentication is solid. The rest … not so much. Wikipedia lists 8 separate vulnerabilities including ones that allow leaking of the remote attestation keys. That’s before you get to attacks on other parts of the chip and security software that allow dumping all the keys stored on chip allowing complete emulation.
AMD didn’t do any better of course One Glitch to Rule Them All: Fault Injection Attacks Against AMD’s Secure Encrypted Virtualization
How low overheads are achieved
ECC (error correcting code) memory includes extra chips to store error correction data. Repurposing this to store MACs +10 bit ECC gets rid of extra data accesses. As long as we have the counters cached for that section of memory there’s no extra overhead. Tradeoff is going from ability to correct 2 flipped bits to only correcting 1 flipped bit.
DRAM internally reads/writes lots of data at once as part of a row. standard DDR memory reads 8KB rows rather than just a 64B cache line. We can store tree parts for a row inside that row to reduce read/write latency/cost since row switches are expensive.
Memory that is rarely changed (EG:executable code) can be protected as a single large block. If we don’t need to read/write individual 64 byte chunks, then a single 4KiB page can be a rewritable unit. Overhead is then negligible if you can store MACs in place of an ECC code.
Counters don’t need to be encrypted, just authenticated. verification can be parallelized if you have the memory bandwidth to do so.
can also delay verification and assume data is fine, if verification fails, entire chip shuts down to prevent bad results from getting out.
Technically we need +12.5% to store MAC tags. If we assume ECC (error correcting code) memory is in use, which already has +12.5% for ECC, we can store MAC tags + smaller ECC at the cost of 1 bit of error correction.
Random reads/writes bloat memory traffic by >3x since we need to deal with 2+ uncached tree levels. We can hide latency by delaying verify of higher tree levels and panicking if it fails before results can leave chip (Intel SGX does exactly this). But if most traffic bloats, we bottleneck on memory bandwidth and perf drops a lot.