Matthew Barnett comments on Evaluating the historical value misspecification argument

Matthew Barnett 6 Oct 2023 1:40 UTC
LW: 10 AF: -1
1
AF
Thanks for this comment. I think this is a good-faith reply that tries to get to the bottom of the disagreement. That said, I think you are still interpreting me as arguing that MIRI said AI wouldn’t understand human values, when I explicitly said that I was not arguing that. Nonetheless, I appreciate the extensive use of quotations to precisely pinpoint where you disagree; this is high-quality engagement.
The main thing I’m claiming is that MIRI people said it would be hard to specify (for example, write into a computer) an explicit function that reflects the human value function with high fidelity, in the sense that judgements from this function about the value of outcomes fairly accurately reflect the judgements of ordinary humans. I think this is simply a distinct concept from the idea of getting an AI to understand human values.
The key difference is the transparency and legibility of how the values are represented: if you solve the problem of value specification/value identification, that means you have an actual function that can tell you the value of any outcome. If you get an AI that merely understands human values, you can’t necessarily use the AI to determine the value of any outcome, because, for example, the AI might lie to you, or simply stay silent.
(I’ve now added further clarification to the post)
I don’t think we’ve ever said that an important subproblem of AI alignment is “make AI smart enough to figure out what goals humans want”?
[...]
I don’t see him saying anywhere “the issue is that the AI doesn’t understand human goals”.
I agree. I am not arguing that MIRI ever thought that AIs wouldn’t understand human goals. I honestly don’t know how to make this point more clear in my post, given that I said that more than once.
But we could already query the human value function by having the AI system query an actual human. What specific problem is meant to be solved by swapping out “query a human” for “query an AI”?
I think there’s considerably more value in having the human value function in an actual computer. More to the point, what I’m saying here is more that MIRI seems to have thought that getting such a function was (1) important for solving alignment, and (2) hard to get (for example because it was hard to extract human values from data). I tried to back this up with evidence in the post, and overall I still feel I succeeded, if you go through the footnotes and read the post carefully.
Your image isn’t displaying for me, but I assume it’s this one?
Yes. I’m not sure why the image isn’t loading. I tried to fix it, but I wasn’t able to. I asked LW admins/mods through the intercom about this.
I wouldn’t read too much into the word choice here, since I think it’s just trying to introduce the Russell quote, which is (again) explicitly about getting content into the AI’s goals, not about getting content into the AI’s beliefs.
Maybe you’re right. I’m just not convinced. I think the idea that Nate wasn’t talking about what I’m calling the value identification/value specification problem in that quote just isn’t a straightforward interpretation of the talk as a whole. I think Nate was actually talking about the idea of specifying human values, in the sense of value identification, as I defined and clarified above, and he also talked about the problem of getting the AI to actually maximize these values (separately from their specification). However, I do agree that he was not talking about getting content merely into the AI’s beliefs.
Some MIRI staff liked that essay at the time, so I don’t think it’s useless, but it’s not the best evidence: I wrote it not long after I first started learning about this whole ‘superintelligence risk’ thing, and I posted it before I’d ever worked at MIRI.
That’s fair. The main reason why I’m referencing it is because it’s what comes up when I google “The genie knows but doesn’t care”, which is a phrase that I saw referenced in this debate before. I don’t know if your essay is the source of the phrase or whether you just titled it that, but I thought it was worth adding a paragraph of clarification about how I interpret that essay, and I’m glad to see you mostly agree with my interpretation.
- Rob Bensinger 6 Oct 2023 2:03 UTC
  LW: 13 AF: 7
  4
  AF Parent
  The main thing I’m claiming is that MIRI said it would be hard to specify (for example, write into a computer) an explicit function that reflects the human value function with high fidelity, in the sense that judgements from this function about the value of outcomes fairly accurately reflect the judgements of ordinary humans. I think this is simply a distinct concept from the idea of getting an AI to understand human values.
  The key difference is the transparency and legibility of how the values are represented: if you solve the problem of value specification/value identification, that means you have an actual function that can tell you the value of any outcome. If you get an AI that merely understands human values, you can’t necessarily use the AI to determine the value of any outcome, because, for example, the AI might lie to you, or simply stay silent.
  Ah, this is helpful clarification! Thanks. :)
  I don’t think MIRI ever considered this an important part of the alignment problem, and I don’t think we expect humanity to solve lots of the alignment problem as a result of having such a tool; but I think I better understand now why you think this is importantly different from “AI ever gets good at NLP at all”.
  don’t know if your essay is the source of the phrase or whether you just titled it
  I think I came up with that particular phrase (though not the idea, of course).
  - Matthew Barnett 8 Oct 2023 23:05 UTC
    LW: 3 AF: 1
    0
    AF Parent
    I don’t think MIRI ever considered this an important part of the alignment problem, and I don’t think we expect humanity to solve lots of the alignment problem as a result of having such a tool
    If you don’t think MIRI ever considered coming up with an “explicit function that reflects the human value function with high fidelity” to be “an important part of the alignment problem”, can you explain this passage from the Arbital page on The problem of fully updated deference?
    One way to look at the central problem of value identification in superintelligence is that we’d ideally want some function that takes a complete but purely physical description of the universe, and spits out our true intended notion of value V in all its glory. Since superintelligences would probably be pretty darned good at collecting data and guessing the empirical state of the universe, this probably solves the whole problem.
    This is not the same problem as writing down our true V by hand. The minimum algorithmic complexity of a meta-utility function ΔU which outputs V after updating on all available evidence, seems plausibly much lower than the minimum algorithmic complexity for writing V down directly. But as of 2017, nobody has yet floated any formal proposal for a ΔU of this sort which has not been immediately shot down.
    Eliezer (who I assume is the author) appears to say in the first paragraph that solving the problem of value identification for superintelligences would “probably [solve] the whole problem”, and by “whole problem” I assume he’s probably referring to what he saw as an important part of the alignment problem (maybe not though?)
    He referred to the problem of value identification as getting “some function that takes a complete but purely physical description of the universe, and spits out our true intended notion of value V in all its glory.” This seems to be very similar to my definition, albeit with the caveat that my definition isn’t about revealing “V in all its glory” but rather, is more about revealing V at the level that an ordinary human is capable of revealing V.
    Unless the sole problem here is that we absolutely need our function that reveals V to be ~perfect, then I think this quote from the Arbital page directly supports my interpretation, and overall supports the thesis in my post pretty strongly (even if I’m wrong about a few minor details).