Daniel Kokotajlo comments on Experimentally evaluating whether honesty generalizes

Daniel Kokotajlo 6 Jul 2021 10:38 UTC
LW: 5 AF: 4
AF
Thanks!
I like your breakdown of A-E, let’s use that going forward.
It sounds like your view is: For “dumb” AIs that aren’t good at reasoning, it’s more likely that they’ll just do B “directly” rather than do E-->D-->C-->B. Because the latter involves a lot of tricky reasoning which they are unable to do. But as we scale up our AIs and make them smarter, eventually the E-->D-->C-->B thing will be more likely than doing B “directly” because it works for approximately any long-term consequence (e.g. paperclips) and thus probably works for some extremely simple/easy-to-have goals, whereas doing B directly is an arbitrary/complex/specific goal that is thus unlikely.
(1) What I was getting at with the “Steps for Dummies” example is that maybe the kind of reasoning required is actually pretty basic/simple/easy and we are already in the regime where E-->D-->C-->B dominates doing B directly. One way it could be easy is if the training data spells it out nicely for the AI. I’d be interested to hear more about why you are confident that we aren’t in this regime yet. Relatedly, what sorts of things would you expect to see AIs doing that would convince you that maybe we are in this regime?
(2) What about A? Doesn’t the same argument for why E-->D-->C-->B dominates B eventually also work to show that it dominates A eventually?
- paulfchristiano 6 Jul 2021 16:20 UTC
  LW: 5 AF: 5
  AF Parent
  I think C->B is already quite hard for language models, maybe it’s possible but still very clearly hard enough that it overwhelms the possible simplicity benefits from E over B (before even adding in the hardness of steps E->D->C). I would update my view a lot if I saw language models doing anything even a little bit like the C->b link.
  I agree that eventually A loses to any of {B, C, D, E}. I’m not sure if E is harder than B to fix, but at any rate my starting point is working on the reasons that A loses to any of the alternatives (e.g. here, here) and then after handling that we can talk about whether there are remaining reasons that E in particular is hard. (My tentative best guess is that there won’t be—I started out thinking about E vs A and then ended up concluding that the examples I was currently thinking about seemed like the core obstructions to making that work.)
  In the meantime, getting empirical evidence about other ways that you don’t learn A is also relevant. (Since those would also ultimately lead to deceptive alignment, even if you learned some crappy A’ rather than either A or B.)