Tao Lin comments on Proveably Safe Self Driving Cars [Modulo Assumptions]

Tao Lin 18 Sep 2024 4:04 UTC
1 point
0
yes, in some cases a much weaker (because it’s constrained to be provable) system can restrict the main ai, but in the case of llm jailbreaks there is no particular hope that such a guard system could work (eg jailbreaks where the llm answers in base64 require the guard to understand base64 and any other code the main ai could use)
- Davidmanheim 18 Sep 2024 10:46 UTC
  2 points
  0
  Parent
  I agree that in the most general possible framing, with no restrictions on output, you cannot guard against all possible side-channels. But that’s not true for proposals like safeguarded AI, where a proof must accompany the output, and it’s not obviously true if the LLM is gated by a system that rejects unintelligible or not-clearly-safe outputs.
  - Tao Lin 18 Sep 2024 17:02 UTC
    1 point
    0
    Parent
    there’s steganography, you’d need to limit total bits not accounted for by the gating system or something to remove them
    - Davidmanheim 22 Sep 2024 13:52 UTC
      4 points
      0
      Parent
      I partly disagree; steganography is only useful when it’s possible for the outside / receiving system to detect and interpret the hidden messages, so if the messages are of a type that outside systems would identify, they can and should be detectable by the gating system as well.
      
      That said, I’d be very interested in looking at formal guarantees that the outputs are minimally complex in some computationally tractable sense, or something similar—it definitely seems like something that @davidad would want to consider.
      - davidad 4 Oct 2024 14:18 UTC
        7 points
        5
        Parent
        Yes, I am indeed thinking about this.
        
        Let’s first consider the easiest case, in which we ask for a solution/output to be not merely provably correct according to some spec, but provably unique. Then there is clearly no space for steganography.
        It is a little bit harder if the solution has some symmetries, like permutations of the order in which things are serialized. For this we can use sorting and other normalization techniques (converting an output to a unique representative of its equivalence class).
        If normalization is hard, we can set up the box to be interactive, so that users cannot see the solution, but can only run queries on it (e.g. “next action please”) and the user must prove that the queries are invariant under (aka “respect”) the equivalence relation.
        But for most real-world problems there are many equally specification-satisfying strategies that are not behaviourally equivalent. The next hardest thing to do is to establish an upper bound on the log-cardinality of the set of behavioural-equivalence-classes that satisfy the specification. We want to keep the channel capacity (in bits per day) available for steganography, to humans or insecure cyber-infrastructure or biosequence synthesis, to a very low threshold (maybe 2 bits per day is okay).
        Often there will be too many degrees of freedom for this, so instead we need to do entropy regularization, which can be done elegantly with GFlowNets. I have not worked out all the details here but the hope is that we can derive a sound bound or PAC estimate of the KL divergence of the observable behaviour (which is sampled from the policy distribution dependent upon a stream of certified randomness) from the Frechet mean/centroid of all correct behaviours. This quantity would provide an upper bound on how much information the optimization process inside the box can transmit to the outside by having selecting a biased policy distribution.
        Geoffrey Irving 15 Oct 2024 13:58 UTC
        1 point
        0
        Parent
        I would expect there to be a time complexity blowup if you try to drive the entropy all the way to zero, unfortunately: such things usually have a multiplier like $log (1 / ϵ)$ where $ϵ$ is the desired entropy leakage. In practice I think that would make it feasible to not leak something like a bit per sentence, and then if you have 1000 sentence you have 1000 bits. That may mean you can get a “not 1GB” guarantee, but not something smaller than that.