ryan_greenblatt comments on Using an LLM perplexity filter to detect weight exfiltration

ryan_greenblatt 22 Jul 2024 22:31 UTC
4 points
0
Sadly, I don’t think this proposal works. More precisely, testing if data is low perplexity is strictly worse than compressing the data against a small language model and then actually sending out a small payload.

Encoding this much data in a way that appears as natural language would be extremely challenging.

Actually, it is quite easy. You can encode whatever bits you’d like in the remaining entropy of the natural language.

If an attacker has full control over your datacenter and you are sending tokens out, then they can encode arbitrary information via controlling the sampling procedure.

At the end of the day, bits are bits.

If you can’t compress the data further, then the remaining information could be anything.

And, if you can compress the data, then you can just ensure that the total output from the datacenter is smaller.

Testing for low perplexity is just a way to demonstrate that data is compressible. But if it is compressible, why not compress it?

Ok, so compression would also work, but why do I claim compression is better?

Compression seems strictly better for a two reasons:
- It allows for ensuring hard physical limits on the number of bits (which is easier to securely enforce).
- It allows for compression at the level of the sampling procedure such that the bits sent out per token is just the KL between the weal model and the strong model (rather than also having a term for the entropy of the strong model). If you just do perplexity of the tokens, then you have to eat the entropy term which could be big.
What links here?
- ryan_greenblatt's comment on Preventing model exfiltration with upload limits by ryan_greenblatt (22 Jul 2024 22:36 UTC; 2 points)
- Adam Karvonen's comment on Preventing model exfiltration with upload limits by ryan_greenblatt (22 Jul 2024 23:18 UTC; 1 point)
- Adam Karvonen 22 Jul 2024 23:16 UTC
  2 points
  −2
  Parent
  The perplexity filter works best if the attacker doesn’t have complete control of the entire data center, but instead limited access to some machines. An underlying assumption is that it’s easier to secure the internet connection choke point, rather than everything going on inside the data center.
  
  This can be stacked with compression by applying the perplexity filter before the compression step.
  
  Assuming that the attacker has access to the perplexity filter model, it would still reduce the possible information density if they have to perform some sort of arithmetic encoding scheme that has low perplexity. I had coworkers working on exactly this problem of encoding information into natural language using arithmetic encoding and GPT-2 small, and there’s a major trade-off between the perplexity of the language generated and the density of the encoded information, although I don’t have numbers available to quantify this trade-off.