johnswentworth comments on Why Don’t We Just… Shoggoth+Face+Paraphraser?

johnswentworth 21 Nov 2024 4:38 UTC
12 points
10
I haven’t decided yet whether to write up a proper “Why Not Just...” for the post’s proposal, but here’s an overcompressed summary. (Note that I’m intentionally playing devil’s advocate here, not giving an all-things-considered reflectively-endorsed take, but the object-level part of my reflectively-endorsed take would be pretty close to this.)
Charlie’s concern isn’t the only thing it doesn’t handle. The only thing this proposal does handle is an AI extremely similar to today’s, thinking very explicitly about intentional deception, and even then the proposal only detects it (as opposed to e.g. providing a way to solve the problem, or even a way to safely iterate without selecting against detectability). And that’s an extremely narrow chunk of the X-risk probability mass—any significant variation in the AI breaks it, any significant variation in the threat model breaks it. The proposal does not generalize to anything.
Charlie’s concern is just one specific example of a way in which the proposal does not generalize. A proper “Why Not Just...” post would list a bunch more such examples.
And as with Charlie’s concern, the meta-level problem is that the proposal also probably wouldn’t get us any closer to handling those more-general situations. Sure, we could make some very toy setups (like the chess thing), and see what the shoggoth+face AI does on those very toy setups, but we get very few bits, and the connection is very tenuous to both other threat models and AIs with any significant differences from the shoggoth+face. Accounting for the inevitable failure to measure what we think we’re measuring (with probability close to 1), such experiments would not actually get us any closer to solving any of the problems which constitute the bulk of the X-risk probability mass. It’s not “a start”, because “a start” would imply that the experiment gets us closer, i.e. that the problem gets easier after doing the experiment. If you try to think about the You Are Not Measuring What You Think You Are Measuring problem as “well, we got at least some tiny epsilon of evidence, right?”, then you will shoot yourself in the foot; such reasoning is technically correct, but the correct value of epsilon is small enough that the correct update from it is not distinguishable from zero in practice.
- johnswentworth 21 Nov 2024 16:55 UTC
  25 points
  5
  Parent
  One example, to add a little concreteness: suppose that the path to AGI is to scale up o1-style inference-time compute, but it requires multiple OOMs of scaling. So it no longer has a relatively-short stream of “internal” thought, it’s more like the natural-language record of an entire simulated society.
  Then:
  - There is no hope of a human reviewing the whole thing, or any significant fraction of the whole thing. Even spot checks don’t help much, because it’s all so context-dependent.
  - Accurate summarization would itself be a big difficult research problem.
  - There’s likely some part of the simulated society explicitly thinking about intentional deception, even if the system as a whole is well aligned.
  - … but that’s largely irrelevant, because in the context of a big complex system like a whole society, the effects of words are very decoupled from their content. Think of e.g. a charity which produces lots of internal discussion about reducing poverty, but frequently has effects entirely different from reducing poverty. The simulated society as a whole might be superintelligent, but its constituent simulated subagents are still pretty stupid (like humans), so their words decouple from effects (like humans’ words).
  … and that’s how the proposal breaks down, for this example.
  What links here?
  - Daniel Kokotajlo 21 Nov 2024 18:48 UTC
    19 points
    6
    Parent
    I think some background context here is that I am not proposing shoggoth+face+paraphraser as a complete alignment solution, as something which we should just build a sovereign with and then let rip. I agree that would go horribly wrong in a bunch of ways.
    
    Instead, I’m saying that this is something we should do, now, that will significantly advance alignment science & governance etc. Also, I do think we’ll be able to get some useful work out of S+F+P systems that we otherwise wouldn’t be able to get.
    
    To get on the object level and engage with your example:
    
    --It’s like the difference between a company whose internal slack messages, comms, etc. are all kept secret from the public, and a company whose messages are all visible to the public. Or replace ‘public’ with ‘regulator.’ It’s not a panacea but it helps a ton.
    --My impression from being at OpenAI is that you can tell a lot about an organization’s priorities by looking at their internal messaging and such dumb metrics as ‘what do they spend their time thinking about.’ For example, ‘have they bothered to make a writeup of the costs and benefits, from an altruistic perspective, of major decision X at all?’ and ‘Now there is a writeup—was it made before, or after, the decision to do X?’—
    I think a similar thing would be true for this hypothetical giant bureaucracy/civilization of S+F+P CoT.
    
    Again, not a final solution. But it buys us time and teaches us things.
  - Steven Byrnes 28 Dec 2024 14:54 UTC
    11 points
    5
    Parent
    Think of e.g. a charity which produces lots of internal discussion about reducing poverty, but frequently has effects entirely different from reducing poverty. The simulated society as a whole might be superintelligent, but its constituent simulated subagents are still pretty stupid (like humans), so their words decouple from effects (like humans’ words).
    I think you’re implying (deliberately or not) something overly pessimistic (in this narrow point).
    Your example is of an intention for something complex and a-priori-implausible to happen (intervention to reduce poverty), but the intention doesn’t actualize. But then your second sentence suggests the reverse: something complex and a-priori-implausible but superficially non-random does happen without a related intention.
    If something complex and a-priori-implausible but superficially non-random happens, then I think there must have been some kind of search or optimization process leading to it. It might be at learning time or it might be at inference time. It might be searching for that exact thing, or it might be searching for something downstream of it or correlated with it. But something. And thus there’s some hope to notice whatever that process is. If it’s at learning time, then we can try to avoid the bad incentives. If it’s at inference time, then there would be an in-principle-recognizable “intention” somewhere in the system, contrary to what you wrote.
    (It’s also true that dangerous things can happen that are not a-priori-implausible and thus don’t require any search or optimization process—like killing everyone by producing pollution. That still seems like a more tractable problem then if we’re up against adversarial planning, e.g. treacherous turns.)