Lorec comments on Thoughts On (Solving) Deep Deception

Lorec Jan 7, 2025, 6:20 PM
4 points
0

It does, however, get at one directionally correct insight: how tight and comprehensive the coupling between the thing you’re optimizing against and the thing you care about has to be. You have to leave no gaps or channels for optimization pressure to lead you to obscure dark regions of cognition. That’s why optimizing against deceptive thoughts doesn’t work: you aren’t applying as strong a binding between what you want and what you’re optimizing for as you should be; what you care about is actually not being deceived, not just producing a model that isn’t trying to deceive you.

[ . . . ]

Deep deception doesn’t occur because the model is incapable of realizing that humans are being deceived. It just doesn’t think about it. It’s still well within the model’s capabilities to understand the consequences of its actions—the requisite cognitive work had to have been within the capabilities of the composite system, after all.

[emphases mine]

This is something that, to this day, I don’t think ~anyone monitoring for “deceptive behavior” in LLMs or large agents comprehends.

I’m confused by this part:

Intuitively, this involves two components: the ability to robustly steer high-level structures like objectives, and something good to target at.

[ . . . ]

Naively, one could just train on the loss function “How much do we like the objectives of this system?”; something like RLHF but with oversight on the true internal representations of important properties. Put another way, it bridges a large part of the gap (and, I think, the important parts) in cases where models understand something we don’t.

[ . . . ]

So, high-level interpretability mostly focuses on the part of the problem that looks like “there are these pretty messy high-level concepts we have in our head that seem very relevant to deciding whether we like this system or not, and those are the things we want to understand and control”. To solve it, figure out the general[6] structure of objectives (for example) in the type of systems we care about, gaining the ability to just search for those structures directly within those systems, understand what a particular system’s objective corresponds to in our ontology from that general structure, and then plug that into a loss function or other things in the vein

[emphases mine]

Imagine if Ofria had, in order to better ‘align’ his simulated organisms with his intended goal of “don’t reproduce too fast”, had run a whole bunch of simulations, noted a bunch of high-level features that correlated with slow reproduction rate, and then run another round with the ‘debugger’ or test environment—this time, regularly arresting the simulation and checking not for overly fast reproduction rate itself, but for the presence of the desirable high-level features. How do you think that would have gone?

There’s a way Ofria could have solved, rather than unsatisfyingly meliorating, his original problem.

--Actually, in the middle of writing this, I just went and re-read the Muehlhauser post, and it turns out, Ofria did, in the end, adjust his simulation to implement the solution I’d been typing up.

I think you, Jose, misunderstood what Ofria did in the end, as being something hackier and less complete than it was.

I’ll put the solution between spoiler tags so readers can have a try at thinking of the answer before they read it.

“In the end, Ofria eventually found a successful fix, by tracking organisms’ replication rates along their lineage, and eliminating any organism (in real time) that would have otherwise out-replicated its ancestors.”

[ Link to Muehlhauser post ]

i.e., Ofria set up the normal [non-test] environment to penalize improvements in reproduction rate. Without a more permissive “real-world” environment to run back to [and do whatever they willed, no matter what they’d done in the test environment], there was nothing for the organisms to fake.

I think you [Jose] misunderstood this as Ofria personally monitoring the simulation and interrupting it manually; in reality, he simply altered the virtual environment to kill off faster-replicating organisms automatically.
What links here?
- Lorec's comment on Testing for Scheming with Model Deletion by Guive (Jan 7, 2025, 7:01 PM; 0 points)