I think 3 is close to an analogy where the LLM is using ‘sacrifice’ in a different way than we’d endorse on reflection, but why should it waste time rationalizing to itself in the CoT? All LLMs involved should just be fine-tuned to not worry about it—they’re collaborators not adversaries—and so collapse to the continuum between cases 1 and 2, where noncentral sorts of sacrifices get ignored first, gradually until at the limit of RL all sacrifices are ignored).
Another thing to think about analogizing is how an AI doing difficult things in the real world is going to need to operate domains that are automatically noncentral to human concepts. It’s like if what we really cared about was whether the AI would sacrifice in 20-move-long sequences that we can’t agree on a criterion to evaluate. Could try a sandwiching experiment where you sandwich both the pretraining corpus and the RL environment.
I think we don’t disagree in terms of what our models predict here. I am saying we should do the experiment and see what happens; we might learn something.
The problem with that sort of attitude is that, when the “experiment” yields so few bits and has such a tenuous connection to the thing we actually care about (as in Charlie’s concern), that’s exactly when You Are Not Measuring What You Think You Are Measuring bites real hard. Like, sure, you’ll see this system do something in the toy chess experiment, but that’s just not going to be particularly relevant to the things an actual smarter-than-human AI does in the situations Charlie’s concerned about. If anything, the experimenter is far more to likely to fool themselves into thinking their results are relevant to Charlie’s concern than they are to correctly learn anything relevant to Charlie’s concern.
That’s a reasonable point and a good cautionary note. Nevertheless, I think someone should do the experiment I described. It feels like a good start to me, even though it doesn’t solve Charlie’s concern.
I haven’t decided yet whether to write up a proper “Why Not Just...” for the post’s proposal, but here’s an overcompressed summary. (Note that I’m intentionally playing devil’s advocate here, not giving an all-things-considered reflectively-endorsed take, but the object-level part of my reflectively-endorsed take would be pretty close to this.)
Charlie’s concern isn’t the only thing it doesn’t handle. The only thing this proposal does handle is an AI extremely similar to today’s, thinking very explicitly about intentional deception, and even then the proposal only detects it (as opposed to e.g. providing a way to solve the problem, or even a way to safely iterate without selecting against detectability). And that’s an extremely narrow chunk of the X-risk probability mass—any significant variation in the AI breaks it, any significant variation in the threat model breaks it. The proposal does not generalize to anything.
Charlie’s concern is just one specific example of a way in which the proposal does not generalize. A proper “Why Not Just...” post would list a bunch more such examples.
And as with Charlie’s concern, the meta-level problem is that the proposal also probably wouldn’t get us any closer to handling those more-general situations. Sure, we could make some very toy setups (like the chess thing), and see what the shoggoth+face AI does on those very toy setups, but we get very few bits, and the connection is very tenuous to both other threat models and AIs with any significant differences from the shoggoth+face. Accounting for the inevitable failure to measure what we think we’re measuring (with probability close to 1), such experiments would not actually get us any closer to solving any of the problems which constitute the bulk of the X-risk probability mass. It’s not “a start”, because “a start” would imply that the experiment gets us closer, i.e. that the problem gets easier after doing the experiment. If you try to think about the You Are Not Measuring What You Think You Are Measuring problem as “well, we got at least some tiny epsilon of evidence, right?”, then you will shoot yourself in the foot; such reasoning is technically correct, but the correct value of epsilon is small enough that the correct update from it is not distinguishable from zero in practice.
One example, to add a little concreteness: suppose that the path to AGI is to scale up o1-style inference-time compute, but it requires multiple OOMs of scaling. So it no longer has a relatively-short stream of “internal” thought, it’s more like the natural-language record of an entire simulated society.
Then:
There is no hope of a human reviewing the whole thing, or any significant fraction of the whole thing. Even spot checks don’t help much, because it’s all so context-dependent.
Accurate summarization would itself be a big difficult research problem.
There’s likely some part of the simulated society explicitly thinking about intentional deception, even if the system as a whole is well aligned.
… but that’s largely irrelevant, because in the context of a big complex system like a whole society, the effects of words are very decoupled from their content. Think of e.g. a charity which produces lots of internal discussion about reducing poverty, but frequently has effects entirely different from reducing poverty. The simulated society as a whole might be superintelligent, but its constituent simulated subagents are still pretty stupid (like humans), so their words decouple from effects (like humans’ words).
… and that’s how the proposal breaks down, for this example.
I think some background context here is that I am not proposing shoggoth+face+paraphraser as a complete alignment solution, as something which we should just build a sovereign with and then let rip. I agree that would go horribly wrong in a bunch of ways.
Instead, I’m saying that this is something we should do, now, that will significantly advance alignment science & governance etc. Also, I do think we’ll be able to get some useful work out of S+F+P systems that we otherwise wouldn’t be able to get.
To get on the object level and engage with your example:
--It’s like the difference between a company whose internal slack messages, comms, etc. are all kept secret from the public, and a company whose messages are all visible to the public. Or replace ‘public’ with ‘regulator.’ It’s not a panacea but it helps a ton. --My impression from being at OpenAI is that you can tell a lot about an organization’s priorities by looking at their internal messaging and such dumb metrics as ‘what do they spend their time thinking about.’ For example, ‘have they bothered to make a writeup of the costs and benefits, from an altruistic perspective, of major decision X at all?’ and ‘Now there is a writeup—was it made before, or after, the decision to do X?’— I think a similar thing would be true for this hypothetical giant bureaucracy/civilization of S+F+P CoT.
Again, not a final solution. But it buys us time and teaches us things.
Think of e.g. a charity which produces lots of internal discussion about reducing poverty, but frequently has effects entirely different from reducing poverty. The simulated society as a whole might be superintelligent, but its constituent simulated subagents are still pretty stupid (like humans), so their words decouple from effects (like humans’ words).
I think you’re implying (deliberately or not) something overly pessimistic (in this narrow point).
Your example is of an intention for something complex and a-priori-implausible to happen (intervention to reduce poverty), but the intention doesn’t actualize. But then your second sentence suggests the reverse: something complex and a-priori-implausible but superficially non-random does happen without a related intention.
If something complex and a-priori-implausible but superficially non-random happens, then I think there must have been some kind of search or optimization process leading to it. It might be at learning time or it might be at inference time. It might be searching for that exact thing, or it might be searching for something downstream of it or correlated with it. But something. And thus there’s some hope to notice whatever that process is. If it’s at learning time, then we can try to avoid the bad incentives. If it’s at inference time, then there would be an in-principle-recognizable “intention” somewhere in the system, contrary to what you wrote.
(It’s also true that dangerous things can happen that are not a-priori-implausible and thus don’t require any search or optimization process—like killing everyone by producing pollution. That still seems like a more tractable problem then if we’re up against adversarial planning, e.g. treacherous turns.)
I think 3 is close to an analogy where the LLM is using ‘sacrifice’ in a different way than we’d endorse on reflection, but why should it waste time rationalizing to itself in the CoT? All LLMs involved should just be fine-tuned to not worry about it—they’re collaborators not adversaries—and so collapse to the continuum between cases 1 and 2, where noncentral sorts of sacrifices get ignored first, gradually until at the limit of RL all sacrifices are ignored).
Another thing to think about analogizing is how an AI doing difficult things in the real world is going to need to operate domains that are automatically noncentral to human concepts. It’s like if what we really cared about was whether the AI would sacrifice in 20-move-long sequences that we can’t agree on a criterion to evaluate. Could try a sandwiching experiment where you sandwich both the pretraining corpus and the RL environment.
I think we don’t disagree in terms of what our models predict here. I am saying we should do the experiment and see what happens; we might learn something.
The problem with that sort of attitude is that, when the “experiment” yields so few bits and has such a tenuous connection to the thing we actually care about (as in Charlie’s concern), that’s exactly when You Are Not Measuring What You Think You Are Measuring bites real hard. Like, sure, you’ll see this system do something in the toy chess experiment, but that’s just not going to be particularly relevant to the things an actual smarter-than-human AI does in the situations Charlie’s concerned about. If anything, the experimenter is far more to likely to fool themselves into thinking their results are relevant to Charlie’s concern than they are to correctly learn anything relevant to Charlie’s concern.
That’s a reasonable point and a good cautionary note. Nevertheless, I think someone should do the experiment I described. It feels like a good start to me, even though it doesn’t solve Charlie’s concern.
I haven’t decided yet whether to write up a proper “Why Not Just...” for the post’s proposal, but here’s an overcompressed summary. (Note that I’m intentionally playing devil’s advocate here, not giving an all-things-considered reflectively-endorsed take, but the object-level part of my reflectively-endorsed take would be pretty close to this.)
Charlie’s concern isn’t the only thing it doesn’t handle. The only thing this proposal does handle is an AI extremely similar to today’s, thinking very explicitly about intentional deception, and even then the proposal only detects it (as opposed to e.g. providing a way to solve the problem, or even a way to safely iterate without selecting against detectability). And that’s an extremely narrow chunk of the X-risk probability mass—any significant variation in the AI breaks it, any significant variation in the threat model breaks it. The proposal does not generalize to anything.
Charlie’s concern is just one specific example of a way in which the proposal does not generalize. A proper “Why Not Just...” post would list a bunch more such examples.
And as with Charlie’s concern, the meta-level problem is that the proposal also probably wouldn’t get us any closer to handling those more-general situations. Sure, we could make some very toy setups (like the chess thing), and see what the shoggoth+face AI does on those very toy setups, but we get very few bits, and the connection is very tenuous to both other threat models and AIs with any significant differences from the shoggoth+face. Accounting for the inevitable failure to measure what we think we’re measuring (with probability close to 1), such experiments would not actually get us any closer to solving any of the problems which constitute the bulk of the X-risk probability mass. It’s not “a start”, because “a start” would imply that the experiment gets us closer, i.e. that the problem gets easier after doing the experiment. If you try to think about the You Are Not Measuring What You Think You Are Measuring problem as “well, we got at least some tiny epsilon of evidence, right?”, then you will shoot yourself in the foot; such reasoning is technically correct, but the correct value of epsilon is small enough that the correct update from it is not distinguishable from zero in practice.
One example, to add a little concreteness: suppose that the path to AGI is to scale up o1-style inference-time compute, but it requires multiple OOMs of scaling. So it no longer has a relatively-short stream of “internal” thought, it’s more like the natural-language record of an entire simulated society.
Then:
There is no hope of a human reviewing the whole thing, or any significant fraction of the whole thing. Even spot checks don’t help much, because it’s all so context-dependent.
Accurate summarization would itself be a big difficult research problem.
There’s likely some part of the simulated society explicitly thinking about intentional deception, even if the system as a whole is well aligned.
… but that’s largely irrelevant, because in the context of a big complex system like a whole society, the effects of words are very decoupled from their content. Think of e.g. a charity which produces lots of internal discussion about reducing poverty, but frequently has effects entirely different from reducing poverty. The simulated society as a whole might be superintelligent, but its constituent simulated subagents are still pretty stupid (like humans), so their words decouple from effects (like humans’ words).
… and that’s how the proposal breaks down, for this example.
I think some background context here is that I am not proposing shoggoth+face+paraphraser as a complete alignment solution, as something which we should just build a sovereign with and then let rip. I agree that would go horribly wrong in a bunch of ways.
Instead, I’m saying that this is something we should do, now, that will significantly advance alignment science & governance etc. Also, I do think we’ll be able to get some useful work out of S+F+P systems that we otherwise wouldn’t be able to get.
To get on the object level and engage with your example:
--It’s like the difference between a company whose internal slack messages, comms, etc. are all kept secret from the public, and a company whose messages are all visible to the public. Or replace ‘public’ with ‘regulator.’ It’s not a panacea but it helps a ton.
--My impression from being at OpenAI is that you can tell a lot about an organization’s priorities by looking at their internal messaging and such dumb metrics as ‘what do they spend their time thinking about.’ For example, ‘have they bothered to make a writeup of the costs and benefits, from an altruistic perspective, of major decision X at all?’ and ‘Now there is a writeup—was it made before, or after, the decision to do X?’—
I think a similar thing would be true for this hypothetical giant bureaucracy/civilization of S+F+P CoT.
Again, not a final solution. But it buys us time and teaches us things.
I think you’re implying (deliberately or not) something overly pessimistic (in this narrow point).
Your example is of an intention for something complex and a-priori-implausible to happen (intervention to reduce poverty), but the intention doesn’t actualize. But then your second sentence suggests the reverse: something complex and a-priori-implausible but superficially non-random does happen without a related intention.
If something complex and a-priori-implausible but superficially non-random happens, then I think there must have been some kind of search or optimization process leading to it. It might be at learning time or it might be at inference time. It might be searching for that exact thing, or it might be searching for something downstream of it or correlated with it. But something. And thus there’s some hope to notice whatever that process is. If it’s at learning time, then we can try to avoid the bad incentives. If it’s at inference time, then there would be an in-principle-recognizable “intention” somewhere in the system, contrary to what you wrote.
(It’s also true that dangerous things can happen that are not a-priori-implausible and thus don’t require any search or optimization process—like killing everyone by producing pollution. That still seems like a more tractable problem then if we’re up against adversarial planning, e.g. treacherous turns.)