Matthew Barnett comments on Evaluating the historical value misspecification argument

Matthew Barnett 8 Oct 2023 19:17 UTC
LW: 2 AF: 1
0
AF
Can you explain how you’re defining outer alignment and value specification?
I’m using this definition, provided by Hubinger et al.
the outer alignment problem is an alignment problem between the system and the humans outside of it (specifically between the base objective and the programmer’s intentions). In the context of machine learning, outer alignment refers to aligning the specified loss function with the intended goal, whereas inner alignment refers to aligning the mesa-objective of a mesa-optimizer with the specified loss function.
Evan Hubinger provided clarification about this definition in his post “Clarifying inner alignment terminology”,
Outer Alignment: An objective function $r$ is outer aligned if all models that perform optimally on $r$ in the limit of perfect training and infinite data are intent aligned.^[2]
I deliberately avoided using the term “outer alignment” in the post because I wanted to be more precise and not get into a debate about whether the value specification problem matches this exact definition. (I think the definitions are subtly different but the difference is not very relevant for the purpose of the post.) Overall, I think the two problems are closely associated and solving one gets you a long way towards solving the other. In the post, I defined the value identification/specification problem as,
I am mainly talking about the problem of how to specify (for example, write into a computer) an explicit function that reflects the human value function with high fidelity, in the sense that judgements from this function about the value of outcomes fairly accurately reflect the judgements of ordinary humans.
This was based on the Arbital entry for the value identification problem, which was defined as a
subproblem category of value alignment which deals with pinpointing valuable outcomes to an advanced agent and distinguishing them from non-valuable outcomes.
I should say note that I used this entry as the primary definition in the post because I was not able to find a clean definition of this problem anywhere else.
I’d appreciate if you clarified whether you are saying:
1. That my definition of the value specification problem is different from how MIRI would have defined it in, say, 2017. You can use Nate Soares’ 2016 paper or their 2017 technical agenda to make your point.
2. That my definition matches how MIRI used the term, but the value specification problem remains very hard and unsolved, and GPT-4 is not even a partial solution to this problem.
3. That my definition matches how MIRI used the term, and we appear to be close to a solution to the problem, but a solution to the problem is not sufficient to solve the hard bits of the outer alignment problem.
I’m more sympathetic to (3) than (2), and more sympathetic to (2) than (1), roughly speaking.