abramdemski

Karma: 19,485

abramdemski Nov 20, 2024, 7:14 PM
6 points
4
in reply to: Noosphere89’s comment on: Why Don’t We Just… Shoggoth+Face+Paraphraser?
I definitely agree that OpenAI would dismiss good evidence of deceptive cognition, though I personally don’t find the o1 report damning, because I find the explanation that it confabulates links in the CoT because there is a difference between it’s capability to retrieve links and it’s ability to know links to be pretty convincing (combined with links being a case where perfect linking is far more useful than approximately linking.)
What do you mean? I don’t get what you are saying is convincing.
Perhaps I should clarify my belief.
The o1 report says the following (emphasis mine):
0.8% of o1-preview’s responses got flagged as being ‘deceptive’ [...] roughly two thirds of which appear to be intentional (0.38%), meaning that there was some evidence in the chain of thought that o1-preview was aware that the answer was incorrect [...] Intentional hallucinations primarily happen when o1-preview is asked to provide references to articles, websites, books, or similar sources that it cannot easily verify without access to internet search, causing o1-preview to make up plausible examples instead. [...] It is encouraging that, in the analysis presented below, while our monitor did find a few forms of the model knowingly presenting incorrect information to the user or omitting important information, it did not find any instances of o1-preview purposely trying to deceive the user for reasons other than satisfying the user request.
- Is this damning in the sense of showing that o1 is a significant danger, which needs to be shut down for public safety?
  - No, I don’t think so.
- Is this damning in the sense of providing significant evidence that the technology behind o1 is dangerous? That is: does it provide reason to condemn scaling up the methodology behind o1? Does it give us significant reason to think that scaled-up o1 would create significant danger to public safety?
  - This is trickier, but I say yes. The deceptive scheming could become much more capable as this technique is scaled up. I don’t think we have a strong enough understanding of why it was deceptive in the cases observed to rule out the development of more dangerous kinds of deception for similar reasons.
- Is this damning in the sense that it shows OpenAI is dismissive of evidence of deception?
  - OpenAI should be commended for looking for deception in this way, and for publishing what they have about the deception they uncovered.
  - However, I don’t buy the distinction they draw in the o1 report about not finding instances of “purposefully trying to deceive the user for reasons other than satisfying the user request”. Providing fake URLs does not serve the purpose of satisfying the user request. We could argue all day about what it was “trying” to do, and whether it “understands” that fake URLs don’t satisfy the user. However, I maintain that it seems at least very plausible that o1 intelligently pursues a goal other than satisfying the user request; plausibly, “provide an answer that shallowly appears to satisfy the user request, even if you know the answer to be wrong” (probably due to the RL incentive).
  - More importantly, OpenAI’s overall behavior does not show concern about this deceptive behavior. It seems like they are judging deception case-by-case, rather than treating it as something to steer hard against in aggregate. This seems bad.

abramdemski Nov 20, 2024, 6:39 PM
4 points
0
in reply to: Nathan Helm-Burger’s comment on: Why Don’t We Just… Shoggoth+Face+Paraphraser?
I think the way around this is to make multiple roll-outs per model per problem. Get n different CoTs from Shoggoth, then for each of those get k different summarizations from Face. You then have n*k final answers. Optimal values for n and k probably depend on how expensive the roll-outs are. This population of answers allows you to usefully get a feedback signal about Shoggoth’s contribution vs Face’s contribution.
So, how do you extract the desired credit-assignment info out of this?
(Also, how do you then use that info?)
If I have 10 students but I can only test them in pairs, I can get some information on how good individual students are by testing multiple different pairings.
If I have just Alice and Bob, I pairing them up multiple times doesn’t help so much.

abramdemski Nov 20, 2024, 6:12 PM
6 points
4
in reply to: Daniel Kokotajlo’s comment on: Why Don’t We Just… Shoggoth+Face+Paraphraser?
The better the paraphraser, the better the anti-steganography incentive. If the paraphraser itself is easily manipulable by adversarial inputs, then you’ll see some weird things like the CoT manipulating the paraphraser to get more useful cognition out of it rather than mere paraphrasing. This is undesirable.
So, I imagine that prompting a vanilla LLM would not be a great approach.
I think first you want to build a fairly good “semantic similarity score” (this shouldn’t be too hard) and then you want to fine-tune an LLM to encourage high-entropy paraphrasing (reward entropy, while also heavily punishing low semantic similarity score). This is similar to @CBiddulph’s idea here.
I agree that this could plausibly work with a relatively small model and therefore result in a relatively low alignment tax. However, note that my view of this whole proposal as an alignment technique requires that you try and change your RL incentive structure when you see a deceptive mismatch between what the shoggoth says and what the face says. This is a significant alignment tax.

abramdemski Nov 20, 2024, 5:56 PM
8 points
3
in reply to: anaguma’s comment on: Why Don’t We Just… Shoggoth+Face+Paraphraser?
In my view, the misaligned behavior is split across the shoggoth and face in most cases. In the case which Daniel K anticipates, the Shoggoth is misaligned in the sense of explicitly (but honestly) plotting to deceive, while the face is misaligned in the sense of dishonestly concealing.
So I am interpreting you as asking why we might expect that specific division of responsibilities, rather than some other divisions such as those you mention.
One reason is that we already see this empirically in the o1 report.
However, to me, the distinction between your case 1 and 2 is not so important. Both scenarios you propose preserve the important thing: there is a mismatch between an honest CoT and a dishonest final answer, which would allow the deceptive final answer to be detected.
There could be other cases where this doesn’t happen; this method is not a guaranteed lie detector. To me, the point is that it will at least sometimes happen. A lie detector that only sometimes works is better than no lie detector at all.
If you give a kid a diary and try to avoid giving them any reason to think you’re reading it, then the kid might write honestly in the diary and reveal some deceptions you wouldn’t previously have known about. The kid might also do any number of other things. However, setting aside the ethics of snooping on your kid and lying to them about it, it’s better than nothing.

abramdemski Nov 19, 2024, 8:56 PM
LW: 6 AF: 4
0
AF
in reply to: abramdemski’s comment on: 5 ways to improve CoT faithfulness
See here for an updated version of my thinking based on discussion with Daniel:
https://www.lesswrong.com/posts/Tzdwetw55JNqFTkzK/why-don-t-we-just-shoggoth-face-paraphraser

Why Don’t We Just… Shoggoth+Face+Paraphraser?

Daniel Kokotajlo and abramdemski

Nov 19, 2024, 8:53 PM

145 points

58 comments14 min readLW link

abramdemski Nov 19, 2024, 4:27 PM
LW: 2 AF: 2
0
AF
in reply to: ryan_greenblatt’s comment on: 5 ways to improve CoT faithfulness
Yeah, I agree that’s at least somewhat true. So, given that, is it a good move or not? I don’t see much of an upside, since there’s a heavy financial incentive for the big labs to do little about concerning results observed in such models. IE, when it comes to the question of whether frontier labs training o1-like models should follow OpenAI’s example of avoiding safety training for the CoT, I think it’s right to discourage them from doing this rather than encourage them… although, I say this with very low confidence.

abramdemski Nov 18, 2024, 5:33 PM
LW: 2 AF: 2
0
AF
in reply to: ryan_greenblatt’s comment on: 5 ways to improve CoT faithfulness
I don’t think it is worth getting into all the stuff you’re mentioning here, but I think a key crux is that I’m expecting the face to be quite dumb (e.g. literally 3.5 sonnet might suffice).
I’m reacting to this whole thing as a possible default configuration going forward, rather than as it exists today. All of my concerns are about what happens when you scale it up. For example, I don’t find o1′s current level of deception hugely concerning in its own right; rather, I see it as a bad sign for the technology as it continues to develop.
I guess one way of framing it is that I find the shoggoth/face idea great as a science experiment; it gives us useful evidence! However, it doesn’t make very much sense to me as a safety method intended for deployment.

abramdemski Nov 16, 2024, 4:33 PM
LW: 3 AF: 3
0
AF
in reply to: Caleb Biddulph’s comment on: 5 ways to improve CoT faithfulness
Ah, yep, this makes sense to me.

abramdemski Nov 16, 2024, 4:32 PM
LW: 2 AF: 2
0
AF
in reply to: Caleb Biddulph’s comment on: 5 ways to improve CoT faithfulness
Yeah, this is a good point, which doesn’t seem addressed by any idea so far.

abramdemski Nov 16, 2024, 4:31 PM
LW: 3 AF: 2
0
AF
in reply to: Noosphere89’s comment on: o1 is a bad idea
Chess is like a bounded, mathematically described universe where all the instrumental convergence stays contained, and only accomplishes a very limited instrumentality in our universe (IE chess programs gain a limited sort of power here by being good playmates).
LLMs touch on the real world far more than that, such that MCTS-like skill at navigating “the LLM world” in contrast to chess sounds to me like it may create a concerning level of real-world-relevant instrumental convergence.

abramdemski Nov 16, 2024, 4:25 PM
LW: 3 AF: 3
0
AF
in reply to: Caleb Biddulph’s comment on: 5 ways to improve CoT faithfulness
Still, IMO, exploiting the frozen planner through adversarial inputs in a single step seems pretty unlikely to be a fruitful strategy for the optimized planner. If the optimized planner is simply trying to accurately convey information to the frozen planner, probably >99% of that information is best to convey through human-understandable text.
Well, I’m not sure. As you mention, it depends on the step size. It also depends on how vulnerable to adversarial inputs LLMs are and how easy they are to find. I haven’t looked into the research on this, but it sounds empirically checkable. If there are lots of adversarial inputs which have a wide array of impacts on LLM behavior, then it would seem very plausible that the optimized planner could find useful ones without being specifically steered in that direction.
This is especially likely to hold under the following conditions:
We can also combine this with other proposals, such as paraphrasing.

abramdemski Nov 16, 2024, 4:19 PM
LW: 2 AF: 2
0
AF
in reply to: Caleb Biddulph’s comment on: 5 ways to improve CoT faithfulness
Yeah, you’re right, I no longer think it’s an interesting proposal.

abramdemski Nov 16, 2024, 4:05 PM
LW: 2 AF: 2
0
AF
in reply to: ryan_greenblatt’s comment on: 5 ways to improve CoT faithfulness
I’m talking about training only the Face, not training the policy (shoggoth) at all with the proposal I’m imagining.
And, these should clearly be separate models such that the training of one doesn’t generalize to the other.
So, making the Face more deceptive doesn’t kill the canary?
Ah, yes, I had neglected this important distinction.
So what do you do in the end, throw away the face?
It seems worth pointing out that although the CoT isn’t trained to be deceptive, it is trained to think thoughts which help out a deceptive face. So it seems plausible that a multi-step training process like:
GPT → basic chat training → shoggoth+face training → discard face & align shoggoth
Would result in deceptive capabilities: the shoggoth CoT learns to explicitly reason about what helps deceive the alignment training in the shoggoth+face step, and then in the final step, some of this is destroyed but some is driven underground. IE, learning to explicitly reason about deception creates latent structure which is useful for that task, and this latent structure will somewhat transfer to the case where the deception has to be hidden from the CoT.
The question is whether we end up with a highly misaligned policy (and this is clear from the chain of thought such that we can discover this). I think Daniel was expecting in the baseline proposal that you end up with a Face that does at least some censoring/mild deception due to training incentives (I think we already see a bit of this with o1).
So it sounds like we have some disagreement about what degree of deception is considered problematic. Again, I’m worried about basin-of-attraction type dynamics. We can think of a whole frontier lab as collectively being a sort of self-improving AGI; we want that recursive self-improvement to be going down a good path rather than a bad one. The LLMs are already helping to improve themselves in little ways (EG as coding assistants) and big ways (EG constitutional AI). So it’s an environment where seeds of deceptiveness can grow and take some control of the overall trajectory.
Maybe your concern is that it is plausible that the policy won’t be misaligned, but the Face will be and this will screw us over. This seems plausible and is worth worrying about (with both proposals). However, I thought that a core part of the hope is that the Face isn’t smart enough to cleverly scheme (unless this schemeing happens in the CoT itself). And indeed, if the Face can cleverly scheme without CoT, what is stopping the policy from doing this too?
I’m not sure what you’re getting at here. It seems like the idea is that Face + CoT is, together, a clever schemer—in the o1 report, we already see cases of explicit scheming in the CoT. So then, using the Face+CoT together (EG trying to train it further, using it in supervision steps for later iterations of the system, etc) is dangerous and difficult for the same reason that a single deceptive LLM would be. If we use Face+CoT together, we have to reason about them together in that way. Or, if we are using them as separate pieces, then we have to account for the part of the overall deceptive-schemer strategy they’ve learned.
I guess you could use Face+CoT as a supervisor (EG in constitutional AI) by looking at the CoT (both by hand and using some automation) and throwing out any deceptive examples, so as to not use them for supervision.
But the overall goal should be to produce a highly non-deceptive system in the end, I think.
Another idea would be to do the whole Face+CoT thing, and then filter for honesty on a bunch of examples, and then re-train a whole new system (well, you can start from the same GPT) only on those honest examples.

abramdemski Nov 16, 2024, 3:07 AM
LW: 2 AF: 2
0
AF
in reply to: Noosphere89’s comment on: o1 is a bad idea
Assuming the goals are done over say 1-10 year timescales, or maybe even just 1 year timescales with no reward-shaping/giving feedback for intermediate rewards at all, I do think that the system won’t work well enough to be relevant, since it requires way too much time training, and plausibly way too much compute depending on how sparse the feedback actually is.
Ah, I wasn’t thinking “sparse” here meant anywhere near that sparse. I thought your dense-vs-sparse was doing something like contrasting RLHF (very dense, basically no instrumental convergence) with chess (very sparse, plenty of instrumental convergence).
I still think o1 is moving towards chess on this spectrum.

abramdemski Nov 16, 2024, 3:01 AM
LW: 2 AF: 2
0
AF
in reply to: Daniel Kokotajlo’s comment on: 5 ways to improve CoT faithfulness
Async is fine with me. Please feel encouraged to start it.

abramdemski Nov 16, 2024, 2:55 AM
LW: 2 AF: 2
0
AF
in reply to: ryan_greenblatt’s comment on: 5 ways to improve CoT faithfulness
My current epistemic state is pretty “I’m not sure at all, it’s a complicated subject” fwiw. I’ll try to generate some responses anyway.
make the system overall much harder to shape in ways you want (because it is actively hiding things from you so your RL is less accurate)
I thought the point was that we wouldn’t train against the CoT basically at all with the baseline proposal? (Similar to o1.)
Sure. Therefore, you can’t check for deceptive behavior using the CoT when you’re trying to train against deceptive behavior. Therefore, Hubinger et al’s conclusions about how deceptive behavior makes the system harder to train still hold even if you sorta get a transparency tool out of this.
You can still inspect the underlying CoT to know if you have big issues.
If other companies go in this direction like OpenAI has, then they’ll probably hide the CoT from the end-users for similar reasons, so you can’t say that the upside is that you, the end user, get to see whether the model is deliberately deceiving you with respect to specific questions.
So you probably mean the upside is that the companies can see this info in the CoT, to check whether models are generally behaving in a deceptive way (similar to what we saw in the o1 report).
Sure, but you’re making a commitment not to use this in training, so what are you doing with this information? If the model is deceptive, you just throw it away (or, more realistically, make excuses about how the deception you observed isn’t so bad). But your version of the proposal deliberately trains the model to be deceptive. So yeah, you can look, but what’s the point? You’re checking whether a coin you rigged to land on heads in fact landed on heads.
(My guess is that you have this concern with the baseline protocol and thus don’t like moving further in this direction.)
Yeah, you’re right about that. But there is a difference. I can think of the original proposal as a “canary in a coal mine”: you hope that the model doesn’t learn deceptive behavior (you plan to chuck it out if you detect any), but you still follow Daniel’s division of labor & refuse to do any alignment training on the CoT so that you have a chance of detecting deception if it occurs.
On the other hand, your version of Daniel’s protocol loses this property. Metaphorically, it’s like killing the canary before you go into the coal mine. If the canary was supposed to be an early warning to get out of there, you’ve completely defeated the point, and for what? (This is the part I’m unclear about, and why I asked my original question to you, rather than Daniel.)
Deceptive behavior is hard to remove once it creeps in. This is nasty stuff. A technique for getting deceptive behavior to reveal itself in the CoT would be wonderfully useful if you could then use that in the training signal to remove the deceptive behavior, but the proposal here relies on not doing that.
Basically, if there is a basin of attraction for corrigibility, it seems to me that deceptive behavior is the opposite of that. It puts you outside the basin of attraction. It means you’re on-coarse for Evan Hubinger’s story about how malign inner optimizers develop. It means you’re on-coarse for the Bad Ending rather than the Good Ending. And for what?
My take (very much siding with Evan Hubinger here) is that the target should be keeping well away from deceptive behavior throughout training, so that there’s never that kernel of skill that could grow into a malign inner optimizer.

abramdemski Nov 15, 2024, 11:30 PM
LW: 2 AF: 2
0
AF
in reply to: Daniel Kokotajlo’s comment on: 5 ways to improve CoT faithfulness
Would you be interested in having a (perhaps brief) LW Dialogue about it where you start with a basic intro to your shoggoth/mask division-of-labor, and I then present my critique?

abramdemski Nov 15, 2024, 11:28 PM
LW: 2 AF: 2
0
AF
in reply to: namidaka’s comment on: o1 is a bad idea
My point here is that at the capability level of GPT4, this distinction isn’t very important. There’s no way to know for sure until it is too late, of course, but it seems pretty plausible that GPT4 isn’t cleverly scheming. It is merely human-level at deception, and doesn’t pursue any coherent overarching goal with it. It clumsily muddles through with mildly-above-average-for-human convincingness. For most queries (it seems plausible to me) it isn’t even adequately coherent to make a crisp distinction between whether it’s honestly trying to answer the question vs deceptively trying to make an answer look good; at its level of capability, it’s mostly the same thing one way or the other. The exceptions to this “mostly” aren’t strategic enough that we expect them to route around obstacles cleverly.
It isn’t much, but it is more than I naively expected.

abramdemski Nov 15, 2024, 11:07 PM
LW: 6 AF: 5
2
AF
in reply to: Noosphere89’s comment on: o1 is a bad idea
I think the crux is I think that the important parts of of LLMs re safety isn’t their safety properties specifically, but rather the evidence they give to what alignment-relevant properties future AIs have
[insert standard skepticism about these sorts of generalizations when generalizing to superintelligence]
But what lesson do you think you can generalize, and why do you think you can generalize that?
I think this is a crux, in that I don’t buy o1 as progressing to a regime where we lose so much dense feedback that it’s alignment relevant, because I think sparse-feedback RL will almost certainly be super-uncompetitive with every other AI architecture until well after AI automates all alignment research.
So, as a speculative example, further along in the direction of o1 you could have something like MCTS help train these things to solve very difficult math problems, with the sparse feedback being given for complete formal proofs.
Similarly, playing text-based video games, with the sparse feedback given for winning.
Similarly, training CoT to reason about code, with sparse feedback given for predictions of the code output.
Etc.
You think these sorts of things just won’t work well enough to be relevant?

abramdemski

Why Don’t We Just… Shog­goth+Face+Para­phraser?

Why Don’t We Just… Shoggoth+Face+Paraphraser?