abramdemski comments on MIRI comments on Cotra’s “Case for Aligning Narrowly Superhuman Models”

abramdemski 9 Mar 2021 21:10 UTC
LW: 25 AF: 17
AF
Rob Bensinger: Nate and I tend to talk about “understandability” instead of “transparency” exactly because we don’t want to sound like we’re talking about normal ML transparency work.
Eliezer Yudkowsky: Other possible synonyms: Clarity, legibility, cognitive readability.
Ajeya Cotra: Thanks all—I like the project of trying to come up with a good handle for the kind of language model transparency we’re excited about (and have talked to Nick, Evan, etc about it too) but I think I don’t want to push it in this blog post right now because I haven’t hit on something I believe in and I want to ship this.
I feel like maybe part of what’s wrong with all the suggested terms (wrt pointing at what Ajeya is excited about) is that transparency, understandability, legibility, and readability all invoke the image of a human standing over a bit of silicone with a magnifying glass and reading off what’s going on inside. Ajeya is excited about asking GPT nicely to apply its medical knowledge, and GPT complying, and we know GPT is complying. Tools for figuring out what’s going on inside GPT are probably an important step to get to that point, especially for becoming confidant that we’re at that point; but it’s not the end goal. The end goal is more like “GPT is frank with you” or “GPT does what you ask, rather than mimicking a human doing what you ask” or something like that.
Like, the property of understanding what it’s doing, rather than the tool that lets you examine it to reach that understanding.
This sits somewhere between the whole alignment problem and transparency.
- Ajeya Cotra 10 Mar 2021 9:05 UTC
  LW: 21 AF: 13
  AF Parent
  This was a really helpful articulation, thanks! I like “frankness”, “forthrightness”, “openness”, etc. (These are all terms I was brainstorming to get at the “ascription universality” concept at one point.)
  - Eliezer Yudkowsky 10 Mar 2021 10:17 UTC
    LW: 21 AF: 12
    AF Parent
    I expect there to be a massive and important distinction between “passive transparency” and “active transparency”, with the latter being much more shaky and potentially concealing of fatality, and the former being cruder as tech at the present rate which is unfortunate because it has so many fewer ways to go wrong. I hope any terminology chosen continues to make the distinction clear.
    - evhub 10 Mar 2021 21:45 UTC
      LW: 12 AF: 8
      AF Parent
      Possibly relevant here is my transparency trichotomy between inspection transparency, training transparency, and architectural transparency. My guess is that inspection transparency and training transparency would mostly go in your “active transparency” bucket and architectural transparency would mostly go in your “passive transparency” bucket. I think there is a position here that makes sense to me, which is perhaps what you’re advocating, that architectural transparency isn’t relying on any sort of path-continuity arguments in terms of how your training process is going to search through the space, since you’re just trying to guarantee that the whole space is transparent, which I do think is pretty good desideratum if it’s achievable. Imo, I mostly bite the bullet on path-continuity arguments being necessary, but it definitely would be nice if they weren’t.