If you want to chat, message me!
LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.
If you want to chat, message me!
LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.
How much do you think subjective experience owes to the internal-state-analyzing machinery?
I’m big on continua and variety. Trees have subjective experience, they just have a little, and it’s different than mine. But if I wanted to inspect that subjective experience, I probably couldn’t do it by strapping a Broca’s area etc. to inputs from the tree so that the tree could produce language about its internal states. The introspection, self-modeling, and language-production circuitry isn’t an impartial window into what’s going on inside, the story it builds reflects choices about how to interpret its inputs.
Are various transparency requirements (E.g. Transparency about when you’re training a compute-frontier model, transparency about the system prompt, transparency about goal-like post-training of frontier models) not orphaned, or are they not even not orphaned?
Sure, that’s one interpretation. If people are working on dual-use technology that’s mostly being used for profit but might sometimes contribute to alignment, I tend to not count them as “doing AI safety work,” but it’s really semantics.
Does lobbying the US government count?
I wonder if there’s some accidental steganography—if you use an LLM to rewrite the shorter scenario, and maybe it has “this is a test” features active while doing that, nudging the text towards sounding like a test.
A lot depends on how broadly you construe the field. There’s plenty of work in academia and at large labs on how to resist jailbreaks, improve RL on human feedback, etc. This is at least adjacent to AI safety work in your first category.
If you put a gun to my head and told me to make some guesses, there’s maybe like 600 people doing that sort of work, about 80 people more aware of alignment problems that get harder as AI gets smarter and so doing more centrally first-category work, about 40 people doing work that looks more like your second category (maybe with another 40 doing off-brand work in academia), and about 400 people doing AI safety work that doesn’t neatly fit into either group.
Yeah, I think instead the numbers only work out if you include things like the cost of land, or the cost of the farmer’s time—and then what’s risen is not the “subsistence cost of horses” per se, but a more general “cost of the things the simplified model of horse productivity didn’t take into account.”
I feel sad that your hypotheses are almost entirely empirical, but seem like they include just enough metaethically-laden ideas that you have to go back to describing what you think people with different commitments might accept or reject.
My checklist:
Moral reasoning is real (or at least, the observables you gesture towards could indeed be observed, setting aside the interpretation of what humans are doing)
Faultless convergence is maybe possible (I’m not totally sure what observables you’re imagining—is an “argument” allowed to be a system that interacts with its audience? If it’s a book, do all people have to read the same sequence of words, or can the book be a choose your own adventure that tells differently-inclined readers to turn to different pages? Do arguments have to be short, or can they take years to finish, interspersed with real-life experiences?), but also I disagree with the connotation that this is good, that convergence via argument is the gold standard, that the connection between being changed by arguments and sharing values is solid rather than fluid.
No Uniqueness
No Semi-uniqueness
Therefore Unificiation is N/A
Man, I’m reacting to an entire genre of thought, not just this post exactly, so apologies for combination unkindness and inaccuracy, but I think it’s barking up the wrong tree to worry about whether AIs will have the Stuff or not. Pain perception, consciousness, moral patiency, these are things that are all-or-nothing-ish for humans, in our everyday experience of the everyday world. But there is no Stuff underlying them, such that things either have the Stuff or don’t have the Stuff—no Platonic-realm enforcement of this all-or-nothing-ish-ness. They’re just patterns that are bimodal in our typical experience.
And then we generate a new kind of thing that falls into neither hump of the distribution, and it’s super tempting to ask questions like “But is it really in the first hump, or really in the second hump?” “What if we treat AIs as if they’re in the first hump, but actually they’re really in the second hump?”
Caption: Which hump is X really in?
The solution seems simple to state but very complicated to do: just make moral decisions about AIs without relying on all-or-nothing properties that may not apply.
Do you have any quick examples of value-shaped interpretations that conflict?
Someone trying but failing to quit smoking. On one interpretation, they don’t really want to smoke, smoking is some sort of mistake. On another interpretation, they do want to smoke, the quitting-related behavior is some sort of mistake (or has a social or epistemological reason).
This example stands in for other sorts of “obvious inconsistency,” biases that we don’t reflectively endorse, etc. But also consider cases where humans say they don’t want something but we (outside the thought experiment) think they actually do want that thing! A possible example is the people who say they would hate a post-work world, they want to keep doing work so they have purpose. Point is, the verbal spec isn’t always right.
The interpretation “Humans want to follow the laws of physics,” versus an interpretation that’s a more filled-in version of “Humans want to do a bunch of human-scale things like talking to humans, eating good food, interacting with nature, learning about the world, etc.” The first is the limit of being more predictive at the cost of having a more complicated model of humans, and as you can tell, it sort of peters out into explaining everything but having no push towards good stuff.
That’s one difference! And probably the most dangerous one, if a clever enough AI notices it.
Some good things to read would be methods based on not straying too far from a “human distribution”: Quantilization (Jessica Taylor paper), the original RLHF paper (Christiano), Sam Marks’ post about decision transformers.
They’re important reads, but ultimately, I’m not satisfied with these for the same reason I mentioned about self-other overlap in the other comment a second ago: we want the AI to treat the human how the human wants to be treated, that doesn’t mean we want the AI to act how the human wants to act. If we can’t build AI that reflects this, we’re missing some big insights.
In short, no, I don’t expect self-other overlap to help. If the human wants coffee, we want the AI to get the human a coffee. We don’t want the AI to get itself a coffee.
Second, the problem isn’t that we know what we want the AI to do, but are worried the AI will “go against it,” so we need to constrain the AI. The problem is that we don’t know what we want the AI to do, certainly not with enough precision to turn it into code.
In value learning, we want the AI to model human preferences, but we also want the AI to do meta-preferential activities like considering the preferences of individual humans and aggregating them together, or considering different viewpoints on what ‘human preferences’ means and aggregating them together. And we don’t just want the AI to do those in arbitrary ways, we want it to learn good ways to navigate different viewpoints from humans’ own intuitions about what it means to do a good job at that.
Seth, I forget where you fall in the intent alignment typology: if we build a superintelligent AI that follows instructions in the way you imagine, can we just give it the instruction “Take autonomous action to do the right thing,” and then it will just go do good stuff without us needing to continue interacting with it in the instruction-following paradigm?
Definitely agree that the implicit “Do what they say [in a way that they would want]” sneaks the problems of value learning into what some people might have hoped was a value-learning-free space. Just want to split some hairs on this:
if an AI understands these values perfectly and is properly motivated to act according to them, that is functionally the same as it having those values itself.
I think this ignores that there are multiple ways to understand humans, what human preferences are, what acting according to them is, etc. There’s no policy that would satisfy all value-shaped interpretations of the user, because some of them conflict. This gives us some wiggle room to imagine different ways of resolving those conflicts, some of which will look more like instruction-following and other that will look more like autonomous action.
From a ‘real alignment’ perspective (how to get the AI to want to do good things and not bad things), I think there are some obvious implications for the future of RLAIF.
You might think of the label ‘RLAIF’ as standing in for the general strategy of leveraging unsupervised data about human behavior to point the AI towards human preferences, using a scaffold that solicits the AI’s predictions (or more general generative output, if the training isn’t for pure prediction) about human preference-laden behaviors, and then transforms those predictions into some sort of supervisory signal.
Similarly, the AZR setup leverages the AI’s unsupervised knowledge of code-quality-laden behaviors, using a scaffold that turns them back into a reward signal that lets the AI quote-unquote “train itself” to code better. Except that relative to vanilla RLAIF, there’s more of an emphasis on generating and solving specific problems that form a curriculum for the agent, rather than just responding well to samples from the training distribution. But now that I’ve described things in this way, you can probably see how to turn this back into RLAIF for alignment.
The overarching problem is, as usual, we don’t understand how to do alignment in a non-hacky way.
We don’t know what sorts of moral reflection are necessary for good outcomes, and we don’t know where human feedback is a necessary ingredient to keep AI meta-ethical evolution grounded to human preferences. But hey, if we try various value learning schemes empirically maybe we’ll learn some things.
We should probably distinguish between a bunch of different things, such as: the ‘race’ between capabilities research and alignment research, AI supervision and control, some notion of “near-term default alignedness” of AIs, interpretability, and value alignment.
I think it’s pretty obvious what’s going on with the race between capabilities and alignment.
This doesn’t have a big impact on AI supervision or control, because the way we currently plan to do these things involves mostly treating the AIs as black boxes—strategies like untrusted model supervision with a small percent of honeypots should still work about as well.
I agree this is bad for the near-term default alignedness of AI, because currently all RL is bad for the near-term default alignedness of AI. But note that this means I disagree with your reasoning: it’s not “control over its training” that’s the problem, it’s the distinction between learning from human data versus learning from a reward signal.
Probably not much impact on interpretability, I just included that for completeness.
The impact on alignment per se is an interesting question (if not about this paper specifically, then about this general direction of progress) that I think people have understandably dodged. I’ll try my hand at a top-level comment on that one.
I think it is fine to assume that the “true” (Coherent Extrapolated Volition) preferences of e.g. humans are transitive.
I think this could be safe or unsafe, depending on implementation details.
By your scare quotes, you probably recognize that there isn’t actually a unique way to get some “true” values out of humans. Instead, there are lots of different agent-ish ways to model humans, and these different ways of modeling humans will have different stuff in the “human values” bucket.
Rather than picking just one way to model humans, and following it and only it as the True Way, it seems a lot safer to build future AI that understands how this agent-ish modeling thing works, and tries to integrate lots of different possible notions of “human values” in a way that makes sense according to humans.
Of course, any agent that does good will make decisions, and so from the outside you can always impute transitive preferences over trajectories to this future AI. That’s fine. And we can go further, and say that a good future AI won’t do obviously-inconsistent-seeming stuff like do a lot of work to set something up and then do a lot of work to dismantle it (without achieving some end in the meantime—the more trivial the ends we allow, the weaker this condition becomes). That’s probably true.
But internally, I think it’s too hasty to say that a good future AI will end up representing humans as having a specific set transitive preferences. It might keep several incompatible models of human preferences in mind, and then aggregate them in a way that isn’t equivalent to any single set of transitive preferences on the part of humans (which means that in the future it might allow or even encourage humans to do some amount of obviously-inconsistent-seeming stuff).
It me.
I dunno, this seems like the sort of thing LLMs would be quite unreliable about—e.g. they’re real bad at introspective questions like “How did you get the answer to this math problem?” They are not model-based, let alone self-modeling, in the way that encourages generalizing to introspection.