[Question] What alignment-related concepts should be better known in the broader ML community?

Lauro Langosco9 Dec 2021 20:44 UTC

6 points

We want to work towards a world in which the alignment problem is a mainstream concern among ML researchers. An important part of this is popularizing alignment-related concepts within the ML community. Here’s a few recent examples:

Reward hacking / misspecification (blog post)
Convergent instrumental goals (paper)
Objective robustness (paper)
Assistance (paper)

(I’m sure this list is missing many examples; let me know if there are any in particular I should include).

Meanwhile, there are many other things that alignment researchers have been thinking about that are not well known within the ML community. Which concepts would you most want to be more widely known / understood?

Lauro Langosco9 Dec 2021 20:44 UTC

6 points

4 comments1 min readLW link

jbkjr 9 Dec 2021 20:02 UTC
14 points
0
This is kind of vague, but I have this sense that almost everybody doing RL and related research takes the notion of “agent” for granted, as if it’s some metaphysical primitive*, as opposed to being a (very) leaky abstraction that exists in the world models of humans. But I don’t think the average alignment researcher has much better intuitions about agency, either, to be honest, even though some spend time thinking about things like embedded agency. It’s hard to think meaningfully about the illusoriness of the Cartesian boundary when you still live 99% of your life and think 99% of your thoughts as if you were a Cartesian agent, fully “in control” of your choices, thoughts, and actions.

(*Not that “agent” couldn’t, in fact, be a metaphysical primitive, just that such “agents” are hardly “agents” in the way most people consider humans to “be agents” [and, equally importantly, other things, like thermostats and quarks, to “not be agents”].)
Daniel Kokotajlo 9 Dec 2021 22:50 UTC
5 points
0
Saints vs. Schemers vs. Sycophants as different kinds of trained models / policies we might get. (I’m drawing from Ajeya’s post here).
There are more academic-sounding terms for these concepts too, I forget where, probably in Paul’s posts about “the intended model” vs. “the instrumental policy” and stuff like that.
Daniel Kokotajlo 9 Dec 2021 22:48 UTC
5 points
0
Inner vs. outer alignment, mesa-optimizers
Charlie Steiner 13 Dec 2021 19:34 UTC
2 points
0
Human values exist within human-scale models of the world.

No comments.