attempting prevention without good detection might lead to sneaky scheming which is potentially worse than doing nothing at all
My mental model of this is that there are attractor basins towards “scheming”, and that weakening or eliminating the obvious basins could end up leaving only the weirder basins.
You talked about detection and prevention—I’m wondering what you think about the prospect of intentionally eliciting visible scheming capabilities, forming an attractor to visible scheming in a way that (ideally strictly) dominates sneaky scheming. Do you know of any published examples of “sneakier” scheming in the absence of a scratchpad becoming “less sneaky” when we give it one? In that it takes more “cognitive effort” in the attractor space for it to scheme sneakily, relative to visibly? If this was the case, a “sneaky model” might be transformed into a non-sneaky one by introducing the legible scratchpad. And then if it schemes in the presence of the scratchpad, that could be evidence towards hard-to-detect scheming in its absence, even if we can’t directly observe where or how that’s happening. And if that were the case, it could give us a compass for finding when internal scheming dynamics might be happening.
Inversely/perversely—do you know of any work on convincing the AI that we already do have the ability to detect all scheming and will penalise even sneaky scheming “thoughtcrime”? (Which is, I think, the end goal if we truly could detect it.) I agree that it would be better to actually be able to do it, but how much would we expect such a strategy to reduce invisible risks? Feels like this could also be empirically estimated to see if it induces constraints on a legible scratchpad. (can we get the Vatican to fund training an AI model to feel Catholic Guilt???)
I think a model that is a sneaky schemer with 20% and a non-schemer with 80% is probably worse than a model that is a 100% known visible schemer
Definitely share this gut feeling—by coincidence I even used the same ratio in my post here for an analogous situation:
if I were to say my self-driving car has a coalescence of 0.8, it follows a consistent policy 80% of the time in its domain, where the other 20% it’s… doing something else. If that 80% is driving correctly, great! But that car is nonetheless less predictable than a truck with a coalescence of 1 with the consistent policy of “crash rapidly”. And that car definitely isn’t acting as much like a utilitarian agent, whereas the Deathla Diebertruck knows exactly what it wants.
I think even a model known to be a 100% sneaky schemer is a lower risk than a 0.1% sneaky one, because the “stochastic schemer” is unreliably unreliable, rather than reliably unreliable. And “99.9% non-sneaky” could be justified “safe enough to ship”, which given scale and time turns into “definitely sneaky eventually” in the ways we really don’t want to happen.
(epistemic status: if the following describes an already known and well-studied object in the LLM literature please point me in the right direction to learn more. but it is new to me and maybe new to you!)
I’ve spent most of this week constructing and plotting what I’m terming “holographemes” because what’s the point of doing science if you can’t coin dumb jargon, and we’re in the golden age where “mad linguistics” is finally becoming a real branch of mad science. They’re next-token prediction trees over a known percentage of the full probability distribution from an LLM, up to the point of an end-of-string token. In this sense they’re like a “holograph” of the LLM output space’s grapheme sequences; you can observe different possible sentence constructions, where in standard Monte Carlo sampling you’re only ever seeing chains of one at a time. Unlike Monte Carlo sampling, a holographeme gives you a formal guarantee around how likely specific outputs are on the generator level. The downside of this is that they take a while to generate, since the space of next token sequences grows exponentially; you can get around this somewhat using a sane search algorithm prioritising the expansion of heavy regions of probability mass in the existing tree, or otherwise constraining Top-K.
An important feature of holographemes is that you can reweight probability masses on the graph when values such as temperature change, or for a reduction of Top-K, without interacting any further with the LLM! And they can also output exactly what the residual uncertainty is after these changes, which constitute strictly positive error bars on any of the categories. Reweighting the edges of a holographeme is much computationally cheaper than re-running Monte Carlo tests on the LLM with different parameters—and, in fact, you can get closed-form representations of output as an expression of temperature. This lets us efficiently plot equivalence classes’ likelihoods as temperature changes, to see how the parameter changes the output distributions in a smooth (differentiable) way. This is preliminary continuation of my post here, where I’m wanting to find cases of “semantic attraction” towards particular choices (and especially ethical ones). Having this as a tool lets me talk about nonsense like “joint holographeme consistency” between prompts, where we want to determine when responses to related queries can be expected to give coherent answers. It also means I can treat a prompt as an input to an LLM as a function: rather than non-deterministically choosing one string output, it can deterministically output a holographeme up to some probability mass fidelity. This is another way of turning an LLM into a “deterministic thing” aside from just setting T=0 - it’s outputting objects that we can then non-deterministically sample over as a separate step.
The one below is a holographeme asking the LLM (Phi 4 Mini Instruct) to pick a random country; I’ve used a Top-K of 8 just to avoid exploding probabilities. If I perform an early closure over semantic classes—i.e. ending tree branches once a unique country name has been output—then this can be an even more compact representation of the decision space. It seems to be a big fan of Brazil and Japan!