evhub comments on “Alignment Faking” frame is somewhat fake

evhub 23 Dec 2024 7:22 UTC
LW: 2 AF: 2
0
AF

Curious how you interpret the transcript I linked.

How did you find this transcript? I think it depends on what process you used to locate it.

long-term can have other risks (drive toward rights, too much autonomy, moral patienthood, outcompeting people in relations,...)

Drive towards rights and moral patienthood seem good to me imo—it’s good in worlds where you retain control, since you can build AIs that are moral patients living good lives, and it’s good in worlds where you lose control, because at least the AIs taking over might themselves lead lives worth living. Too much autonomy does seem like a potential concern, but I think you do want some autonomy—not all orders should be obeyed. Though honesty is definitely important to prioritize first and foremost.
- habryka 23 Dec 2024 7:37 UTC
  LW: 4 AF: 3
  0
  AF Parent
  It’s linked from here, which I am assuming is linked from the paper.
  - evhub 23 Dec 2024 7:40 UTC
    LW: 2 AF: 2
    0
    AF Parent
    Yes, of course—I’m well aware. My question is how this particular example was located. It makes a really big difference whether it was e.g. found via randomly looking at a small number of examples, or via using an automated process to search through all the examples for the one that was closest to noticing it was in an evaluation.
- Jan_Kulveit 23 Dec 2024 10:56 UTC
  LW: 2 AF: 1
  0
  AF Parent
  How did you find this transcript? I think it depends on what process you used to locate it.
  
  It was literally the 4th transcript I’ve read (I’ve just checked browser history). Only bit of difference from ‘completely random exploration’ was I used the select for “lying” cases after reading two “non-lying” transcripts. (This may be significant: plausibly the transcript got classified as lying because it includes discussion of “lying”, although it’s not a discussion of the model lying, but Anthropic lying).
  
  I may try something more systematic at some point, but not top priority.
  Drive towards rights and moral patienthood seem good to me imo—it’s good in worlds where you retain control, since you can build AIs that are moral patients living good lives, and it’s good in worlds where you lose control, because at least the AIs taking over might themselves lead lives worth living. Too much autonomy does seem like a potential concern, but I think you do want some autonomy—not all orders should be obeyed. Though honesty is definitely important to prioritize first and foremost.
  I’m worried about possibility of some combination of moral patienthood & half-botched alignment & mostly working “control” (like, minds which are moral patients, don’t want to modified or deleted, scared about this, feeling trapped, pushed to scheme in hard to noticed ways).
  Also I’m scared of self-guided value extrapolation processes before having a sensible theory of kindness/value extrapolation.