I think the “Provably Safe ML” section is my main crux. For example, you write:
One potential solution is to externally gate the AI system with provable code. In this case, the driving might be handled by an unsafe AI system, but its behavior would have “safety in the loop” by having simpler and provably safe code restrict what the driving system can output, to respect the rules noted above. This does not guarantee that the AI is a safe driver—it just keeps such systems in a provably safe box.
I currently believe that if you try to do this, you will either have to restrict the outputs so much that the car wouldn’t be able to drive well, or else fail to prove that the actions allowed by the gate are safe. Perhaps you can elaborate on why this approach seems like it could work?
(I feel similarly about other proposals in that section.)
I agree that it’s the most challenging part, and there are unsolved problems, but I don’t share your intuition that it’s in some way unsolvable, so I suspect we’re thinking of very different types of things.
For RSS specifically, Rule 5 is obviously the most challenging, but it’s also not in general required for the not-being-at-fault guarantee, and Rule 4 is largely about ensuring the relationship between sensor uncertainty in low visibility areas and the other rules—respecting distance and not hitting things—are enforced. Other than that, right of way rules are very simple, if the car correctly detects that the situation is one where they apply, and changing lanes is based on a very simple formula for distance, and assuming the car isn’t changing lanes, during driving, in order to follow the rules, you essentially only need to restrict speed, which seems like something you can check very easily.
yes, in some cases a much weaker (because it’s constrained to be provable) system can restrict the main ai, but in the case of llm jailbreaks there is no particular hope that such a guard system could work (eg jailbreaks where the llm answers in base64 require the guard to understand base64 and any other code the main ai could use)
I agree that in the most general possible framing, with no restrictions on output, you cannot guard against all possible side-channels. But that’s not true for proposals like safeguarded AI, where a proof must accompany the output, and it’s not obviously true if the LLM is gated by a system that rejects unintelligible or not-clearly-safe outputs.
I partly disagree; steganography is only useful when it’s possible for the outside / receiving system to detect and interpret the hidden messages, so if the messages are of a type that outside systems would identify, they can and should be detectable by the gating system as well.
That said, I’d be very interested in looking at formal guarantees that the outputs are minimally complex in some computationally tractable sense, or something similar—it definitely seems like something that @davidad would want to consider.
I think the “Provably Safe ML” section is my main crux. For example, you write:
I currently believe that if you try to do this, you will either have to restrict the outputs so much that the car wouldn’t be able to drive well, or else fail to prove that the actions allowed by the gate are safe. Perhaps you can elaborate on why this approach seems like it could work?
(I feel similarly about other proposals in that section.)
I agree that it’s the most challenging part, and there are unsolved problems, but I don’t share your intuition that it’s in some way unsolvable, so I suspect we’re thinking of very different types of things.
For RSS specifically, Rule 5 is obviously the most challenging, but it’s also not in general required for the not-being-at-fault guarantee, and Rule 4 is largely about ensuring the relationship between sensor uncertainty in low visibility areas and the other rules—respecting distance and not hitting things—are enforced. Other than that, right of way rules are very simple, if the car correctly detects that the situation is one where they apply, and changing lanes is based on a very simple formula for distance, and assuming the car isn’t changing lanes, during driving, in order to follow the rules, you essentially only need to restrict speed, which seems like something you can check very easily.
yes, in some cases a much weaker (because it’s constrained to be provable) system can restrict the main ai, but in the case of llm jailbreaks there is no particular hope that such a guard system could work (eg jailbreaks where the llm answers in base64 require the guard to understand base64 and any other code the main ai could use)
I agree that in the most general possible framing, with no restrictions on output, you cannot guard against all possible side-channels. But that’s not true for proposals like safeguarded AI, where a proof must accompany the output, and it’s not obviously true if the LLM is gated by a system that rejects unintelligible or not-clearly-safe outputs.
there’s steganography, you’d need to limit total bits not accounted for by the gating system or something to remove them
I partly disagree; steganography is only useful when it’s possible for the outside / receiving system to detect and interpret the hidden messages, so if the messages are of a type that outside systems would identify, they can and should be detectable by the gating system as well.
That said, I’d be very interested in looking at formal guarantees that the outputs are minimally complex in some computationally tractable sense, or something similar—it definitely seems like something that @davidad would want to consider.