I think the “Provably Safe ML” section is my main crux. For example, you write:
One potential solution is to externally gate the AI system with provable code. In this case, the driving might be handled by an unsafe AI system, but its behavior would have “safety in the loop” by having simpler and provably safe code restrict what the driving system can output, to respect the rules noted above. This does not guarantee that the AI is a safe driver—it just keeps such systems in a provably safe box.
I currently believe that if you try to do this, you will either have to restrict the outputs so much that the car wouldn’t be able to drive well, or else fail to prove that the actions allowed by the gate are safe. Perhaps you can elaborate on why this approach seems like it could work?
(I feel similarly about other proposals in that section.)
I agree that it’s the most challenging part, and there are unsolved problems, but I don’t share your intuition that it’s in some way unsolvable, so I suspect we’re thinking of very different types of things.
For RSS specifically, Rule 5 is obviously the most challenging, but it’s also not in general required for the not-being-at-fault guarantee, and Rule 4 is largely about ensuring the relationship between sensor uncertainty in low visibility areas and the other rules—respecting distance and not hitting things—are enforced. Other than that, right of way rules are very simple, if the car correctly detects that the situation is one where they apply, and changing lanes is based on a very simple formula for distance, and assuming the car isn’t changing lanes, during driving, in order to follow the rules, you essentially only need to restrict speed, which seems like something you can check very easily.
yes, in some cases a much weaker (because it’s constrained to be provable) system can restrict the main ai, but in the case of llm jailbreaks there is no particular hope that such a guard system could work (eg jailbreaks where the llm answers in base64 require the guard to understand base64 and any other code the main ai could use)
I agree that in the most general possible framing, with no restrictions on output, you cannot guard against all possible side-channels. But that’s not true for proposals like safeguarded AI, where a proof must accompany the output, and it’s not obviously true if the LLM is gated by a system that rejects unintelligible or not-clearly-safe outputs.
I partly disagree; steganography is only useful when it’s possible for the outside / receiving system to detect and interpret the hidden messages, so if the messages are of a type that outside systems would identify, they can and should be detectable by the gating system as well.
That said, I’d be very interested in looking at formal guarantees that the outputs are minimally complex in some computationally tractable sense, or something similar—it definitely seems like something that @davidad would want to consider.
Let’s first consider the easiest case, in which we ask for a solution/output to be not merely provably correct according to some spec, but provably unique. Then there is clearly no space for steganography.
It is a little bit harder if the solution has some symmetries, like permutations of the order in which things are serialized. For this we can use sorting and other normalization techniques (converting an output to a unique representative of its equivalence class).
If normalization is hard, we can set up the box to be interactive, so that users cannot see the solution, but can only run queries on it (e.g. “next action please”) and the user must prove that the queries are invariant under (aka “respect”) the equivalence relation.
But for most real-world problems there are many equally specification-satisfying strategies that are not behaviourally equivalent. The next hardest thing to do is to establish an upper bound on the log-cardinality of the set of behavioural-equivalence-classes that satisfy the specification. We want to keep the channel capacity (in bits per day) available for steganography, to humans or insecure cyber-infrastructure or biosequence synthesis, to a very low threshold (maybe 2 bits per day is okay).
Often there will be too many degrees of freedom for this, so instead we need to do entropy regularization, which can be done elegantly with GFlowNets. I have not worked out all the details here but the hope is that we can derive a sound bound or PAC estimate of the KL divergence of the observable behaviour (which is sampled from the policy distribution dependent upon a stream of certified randomness) from the Frechet mean/centroid of all correct behaviours. This quantity would provide an upper bound on how much information the optimization process inside the box can transmit to the outside by having selecting a biased policy distribution.
I would expect there to be a time complexity blowup if you try to drive the entropy all the way to zero, unfortunately: such things usually have a multiplier like log(1/ϵ) where ϵ is the desired entropy leakage. In practice I think that would make it feasible to not leak something like a bit per sentence, and then if you have 1000 sentence you have 1000 bits. That may mean you can get a “not 1GB” guarantee, but not something smaller than that.
I think the “Provably Safe ML” section is my main crux. For example, you write:
I currently believe that if you try to do this, you will either have to restrict the outputs so much that the car wouldn’t be able to drive well, or else fail to prove that the actions allowed by the gate are safe. Perhaps you can elaborate on why this approach seems like it could work?
(I feel similarly about other proposals in that section.)
I agree that it’s the most challenging part, and there are unsolved problems, but I don’t share your intuition that it’s in some way unsolvable, so I suspect we’re thinking of very different types of things.
For RSS specifically, Rule 5 is obviously the most challenging, but it’s also not in general required for the not-being-at-fault guarantee, and Rule 4 is largely about ensuring the relationship between sensor uncertainty in low visibility areas and the other rules—respecting distance and not hitting things—are enforced. Other than that, right of way rules are very simple, if the car correctly detects that the situation is one where they apply, and changing lanes is based on a very simple formula for distance, and assuming the car isn’t changing lanes, during driving, in order to follow the rules, you essentially only need to restrict speed, which seems like something you can check very easily.
yes, in some cases a much weaker (because it’s constrained to be provable) system can restrict the main ai, but in the case of llm jailbreaks there is no particular hope that such a guard system could work (eg jailbreaks where the llm answers in base64 require the guard to understand base64 and any other code the main ai could use)
I agree that in the most general possible framing, with no restrictions on output, you cannot guard against all possible side-channels. But that’s not true for proposals like safeguarded AI, where a proof must accompany the output, and it’s not obviously true if the LLM is gated by a system that rejects unintelligible or not-clearly-safe outputs.
there’s steganography, you’d need to limit total bits not accounted for by the gating system or something to remove them
I partly disagree; steganography is only useful when it’s possible for the outside / receiving system to detect and interpret the hidden messages, so if the messages are of a type that outside systems would identify, they can and should be detectable by the gating system as well.
That said, I’d be very interested in looking at formal guarantees that the outputs are minimally complex in some computationally tractable sense, or something similar—it definitely seems like something that @davidad would want to consider.
Yes, I am indeed thinking about this.
Let’s first consider the easiest case, in which we ask for a solution/output to be not merely provably correct according to some spec, but provably unique. Then there is clearly no space for steganography.
It is a little bit harder if the solution has some symmetries, like permutations of the order in which things are serialized. For this we can use sorting and other normalization techniques (converting an output to a unique representative of its equivalence class).
If normalization is hard, we can set up the box to be interactive, so that users cannot see the solution, but can only run queries on it (e.g. “next action please”) and the user must prove that the queries are invariant under (aka “respect”) the equivalence relation.
But for most real-world problems there are many equally specification-satisfying strategies that are not behaviourally equivalent. The next hardest thing to do is to establish an upper bound on the log-cardinality of the set of behavioural-equivalence-classes that satisfy the specification. We want to keep the channel capacity (in bits per day) available for steganography, to humans or insecure cyber-infrastructure or biosequence synthesis, to a very low threshold (maybe 2 bits per day is okay).
Often there will be too many degrees of freedom for this, so instead we need to do entropy regularization, which can be done elegantly with GFlowNets. I have not worked out all the details here but the hope is that we can derive a sound bound or PAC estimate of the KL divergence of the observable behaviour (which is sampled from the policy distribution dependent upon a stream of certified randomness) from the Frechet mean/centroid of all correct behaviours. This quantity would provide an upper bound on how much information the optimization process inside the box can transmit to the outside by having selecting a biased policy distribution.
I would expect there to be a time complexity blowup if you try to drive the entropy all the way to zero, unfortunately: such things usually have a multiplier like log(1/ϵ) where ϵ is the desired entropy leakage. In practice I think that would make it feasible to not leak something like a bit per sentence, and then if you have 1000 sentence you have 1000 bits. That may mean you can get a “not 1GB” guarantee, but not something smaller than that.