Lauro Langosco comments on Evaluating the historical value misspecification argument

Lauro Langosco 6 Oct 2023 16:16 UTC
LW: 14 AF: 9
6
AF
I think it’s false in the sense that MIRI never claimed that it would be hard to build an AI with GPT-4 level understanding of human values + GPT-4 level of willingness to answer honestly (as far as I can tell). The reason I think it’s false is mostly that I haven’t seen a claim like that made anywhere, including in the posts you cite.

I agree lots of the responses elide the part where you emphasize that it’s important how GPT-4 doesn’t just understand human values, but is also “willing” to answer questions somewhat honestly. TBH I don’t understand why that’s an important part of the picture for you, and I can see why some responses would just see the “GPT-4 understands human values” part as the important bit (I made that mistake too on my first reading, before I went back and re-read).

It seems to me that trying to explain the original motivations for posts like Hidden Complexity of Wishes is a good attempt at resolving this discussion, and it looks to me as if the responses from MIRI are trying to do that, which is part of why I wanted to disagree with the claim that the responses are missing the point / not engaging productively.
- Matthew Barnett 8 Oct 2023 3:59 UTC
  LW: 3 AF: 2
  −6
  AF Parent
  
  I think it’s false in the sense that MIRI never claimed that it would be hard to build an AI with GPT-4 level understanding of human values + GPT-4 level of willingness to answer honestly (as far as I can tell). The reason I think it’s false is mostly that I haven’t seen a claim like that made anywhere, including in the posts you cite.
  
  I don’t think it’s necessary for them to have made that exact claim. The point is that they said value specification would be hard.
  
  If you solve value specification, then you’ve arguably solved ~~the outer alignment problem~~ a large part of the outer alignment problem. Then, you just need to build a function maximizer that allows you to robustly maximize the utility function that you’ve specified. [ETA: btw, I’m not saying the outer alignment problem has been fully solved already. I’m making a claim about progress, not about whether we’re completely finished.]
  
  I interpret MIRI as saying “but the hard part is building a function maximizer that robustly maximizes any utility function you specify”. And while I agree that this represents their current view, I don’t think this was always their view. You can read the citations in the post carefully, and I don’t think they support the idea that they’ve consistently always considered inner alignment to be the only hard part of the problem. I’m not claiming they never thought inner alignment was hard. But I am saying they thought value specification would be hard and an important part of the alignment problem.
  - Lauro Langosco 8 Oct 2023 14:53 UTC
    LW: 4 AF: 2
    2
    AF Parent
    I think the specification problem is still hard and unsolved. It looks like you’re using a different definition of ‘specification problem’ / ‘outer alignment’ than others, and this is causing confusion.
    
    IMO all these terms are a bit fuzzy / hard to pin down, and so it makes sense that they’d lead to disagreement sometimes. The best way (afaict) to avoid this is to keep the terms grounded in ‘what would be useful for avoiding AGI doom’? To me it looks like on your definition, outer alignment is basically a trivial problem that doesn’t help alignment much.
    
    More generally, I think this discussion would be more grounded / useful if you made more object-level claims about how value specification being solved (on your view) might be useful, rather than meta claims about what others were wrong about.
    - Matthew Barnett 8 Oct 2023 19:17 UTC
      LW: 2 AF: 1
      0
      AF Parent
      Can you explain how you’re defining outer alignment and value specification?
      I’m using this definition, provided by Hubinger et al.
      the outer alignment problem is an alignment problem between the system and the humans outside of it (specifically between the base objective and the programmer’s intentions). In the context of machine learning, outer alignment refers to aligning the specified loss function with the intended goal, whereas inner alignment refers to aligning the mesa-objective of a mesa-optimizer with the specified loss function.
      Evan Hubinger provided clarification about this definition in his post “Clarifying inner alignment terminology”,
      Outer Alignment: An objective function $r$ is outer aligned if all models that perform optimally on $r$ in the limit of perfect training and infinite data are intent aligned.^[2]
      I deliberately avoided using the term “outer alignment” in the post because I wanted to be more precise and not get into a debate about whether the value specification problem matches this exact definition. (I think the definitions are subtly different but the difference is not very relevant for the purpose of the post.) Overall, I think the two problems are closely associated and solving one gets you a long way towards solving the other. In the post, I defined the value identification/specification problem as,
      I am mainly talking about the problem of how to specify (for example, write into a computer) an explicit function that reflects the human value function with high fidelity, in the sense that judgements from this function about the value of outcomes fairly accurately reflect the judgements of ordinary humans.
      This was based on the Arbital entry for the value identification problem, which was defined as a
      subproblem category of value alignment which deals with pinpointing valuable outcomes to an advanced agent and distinguishing them from non-valuable outcomes.
      I should say note that I used this entry as the primary definition in the post because I was not able to find a clean definition of this problem anywhere else.
      I’d appreciate if you clarified whether you are saying:
      That my definition of the value specification problem is different from how MIRI would have defined it in, say, 2017. You can use Nate Soares’ 2016 paper or their 2017 technical agenda to make your point.
      That my definition matches how MIRI used the term, but the value specification problem remains very hard and unsolved, and GPT-4 is not even a partial solution to this problem.
      That my definition matches how MIRI used the term, and we appear to be close to a solution to the problem, but a solution to the problem is not sufficient to solve the hard bits of the outer alignment problem.
      I’m more sympathetic to (3) than (2), and more sympathetic to (2) than (1), roughly speaking.