mattmacdermott comments on Dreams of AI alignment: The danger of suggestive names

mattmacdermott 11 Feb 2024 17:24 UTC
47 points
35
I have read many of your posts on these topics, appreciate them, and I get value from the model of you in my head that periodically checks for these sorts of reasoning mistakes.

But I worry that the focus on ‘bad terminology’ rather than reasoning mistakes themselves is misguided.

To choose the most clear cut example, I’m quite confident that when I say ‘expectation’ I mean ‘weighted average over a probability distribution’ and not ‘anticipation of an inner consciousness’. Perhaps some people conflate the two, in which case it’s useful to disabuse them of the confusion, but I really would not like it to become the case that every time I said ‘expectation’ I had to add a caveat to prove I know the difference, lest I get ‘corrected’ or sneered at.

For a probably more contentious example, I’m also reasonably confident that when I use the phrase ‘the purpose of RL is to maximise reward’, the thing I mean by it is something you wouldn’t object to, and which does not cause me confusion. And I think those words are a straightforward way to say the thing I mean. I agree that some people have mistaken heuristics for thinking about RL, but I doubt you would disagree very strongly with mine, and yet if I was to talk to you about RL I feel I would be walking on eggshells trying to use long-winded language in such a way as to not get me marked down as one of ‘those idiots’.

I wonder if it’s better, as a general rule, to focus on policing arguments rather than language? If somebody uses terminology you dislike to generate a flawed reasoning step and arrive at a wrong conclusion, then you should be able to demonstrate the mistake by unpacking the terminology into your preferred version, and it’s a fair cop.

But until you’ve seen them use it to reason poorly, perhaps it’s a good norm to assume they’re not confused about things, even if the terminology feels like it has misleading connotations to you.
- johnswentworth 12 Feb 2024 18:45 UTC
  14 points
  8
  Parent
  There’s a difficult problem here.
  Personally, when I see someone using the sorts of terms Turner is complaining about, I mentally flag it (and sometimes verbally flag it, saying something like “Not sure if it’s relevant yet, but I want to flag that we’re using <phrase> loosely here, we might have to come back to that later”). Then I mentally track both my optimistic-guess at what the person is saying, and the thing I would mean if I used the same words internally. If and when one of those mental pictures throws an error in the person’s argument, I’ll verbally express confusion and unroll the stack.
  A major problem with this strategy is that it taxes working memory heavily. If I’m tired, I basically can’t do it. I would guess that people with less baseline working memory to spare just wouldn’t be able to do it at all, typically. Skill can help somewhat: it helps to be familiar with an argument already, it helps to have the general-purpose skill of keeping at least one concrete example in one’s head, it helps to ask for examples… but even with the skills, working memory is a pretty important limiting factor.
  So if I’m unable to do the first-best thing at the moment, what should I fall back on? In practice I just don’t do a very good job following arguments when tired, but if I were optimizing for that… I’d probably fall back on asking for a concrete example every time someone uses one of the words Turner is complaining about. Wording would be something like “Ok pause, people use ‘optimizer’ to mean different things, can you please give a prototypical example of the sort of thing you mean so I know what we’re talking about?”.
  … and of course when reading something, even that strategy is a pain in the ass, because I have to e.g. leave a comment asking for clarification and then the turn time is very slow.
- TurnTrout 12 Feb 2024 18:37 UTC
  6 points
  3
  Parent
  I’m sympathetic to your comment, but let me add some additional perspective.
  1. While using (IMO) imprecise or misleading language doesn’t guarantee you’re reasoning improperly, it is evidence from my perspective. As you say, that doesn’t mean one should “act” on that evidence by “correcting” the person, and often I don’t. Just the other day I had a long conversation where I and the other person both talked about the geometry of the mapping from reward functions to optimal policies in MDPs.
  2. I think I do generally only criticize terminology when I perceive an actual reasoning mistake. This might be surprising, but that’s probably because I perceive reasoning mistakes all over the place in ways which seem tightly intertwined with language and word-games.
    Exception: if someone has signed up to be mentored by me, I will mention “BTW I find it to help my thinking to use word X instead of Y, do what you want.”
  3. You might have glossed over the part where I tried to emphasize “at least try to do this in the privacy of your own mind, even if you use these terms to communicate with other people.” This part interfaces with your “eggshells” concern.
  4. It’s important to realize that such language creates a hostile environment for reasoning, especially for new researchers. Statistically, some people will be misled, and the costs can be great. To be concrete, I probably wasted about 3,000 hours of my life due to these “word games.”
  5. Nearly all language has undue technical connotations. For example, “reinforcement” is not a perfectly neutral technical word, but it sure is better than “reward.” Furthermore, I think that we can definitely do better than using extremely loaded terms like “saints.”
  But until you’ve seen them use it to reason poorly, perhaps it’s a good norm to assume they’re not confused about things, even if the terminology feels like it has misleading connotations to you.
  Well, not quite what I was trying to advocate. I didn’t conclude that many people are confused about things because I saw their words and thought they were bad. I concluded that many people are confused about things because I repeatedly:
  1. saw their words,
  2. thought the words were bad,
  3. talked with the person and perceived reasoning mistakes mirroring the badness in their words,
  4. and then concluded they are confused!