paulfchristiano comments on Experimentally evaluating whether honesty generalizes

paulfchristiano 6 Jul 2021 16:20 UTC
LW: 5 AF: 5
AF
I think C->B is already quite hard for language models, maybe it’s possible but still very clearly hard enough that it overwhelms the possible simplicity benefits from E over B (before even adding in the hardness of steps E->D->C). I would update my view a lot if I saw language models doing anything even a little bit like the C->b link.
I agree that eventually A loses to any of {B, C, D, E}. I’m not sure if E is harder than B to fix, but at any rate my starting point is working on the reasons that A loses to any of the alternatives (e.g. here, here) and then after handling that we can talk about whether there are remaining reasons that E in particular is hard. (My tentative best guess is that there won’t be—I started out thinking about E vs A and then ended up concluding that the examples I was currently thinking about seemed like the core obstructions to making that work.)
In the meantime, getting empirical evidence about other ways that you don’t learn A is also relevant. (Since those would also ultimately lead to deceptive alignment, even if you learned some crappy A’ rather than either A or B.)