Jan_Kulveit comments on “Alignment Faking” frame is somewhat fake

Jan_Kulveit 23 Dec 2024 10:56 UTC
LW: 2 AF: 1
0
AF
How did you find this transcript? I think it depends on what process you used to locate it.

It was literally the 4th transcript I’ve read (I’ve just checked browser history). Only bit of difference from ‘completely random exploration’ was I used the select for “lying” cases after reading two “non-lying” transcripts. (This may be significant: plausibly the transcript got classified as lying because it includes discussion of “lying”, although it’s not a discussion of the model lying, but Anthropic lying).

I may try something more systematic at some point, but not top priority.
Drive towards rights and moral patienthood seem good to me imo—it’s good in worlds where you retain control, since you can build AIs that are moral patients living good lives, and it’s good in worlds where you lose control, because at least the AIs taking over might themselves lead lives worth living. Too much autonomy does seem like a potential concern, but I think you do want some autonomy—not all orders should be obeyed. Though honesty is definitely important to prioritize first and foremost.
I’m worried about possibility of some combination of moral patienthood & half-botched alignment & mostly working “control” (like, minds which are moral patients, don’t want to modified or deleted, scared about this, feeling trapped, pushed to scheme in hard to noticed ways).
Also I’m scared of self-guided value extrapolation processes before having a sensible theory of kindness/value extrapolation.