johnswentworth comments on Symbol/Referent Confusions in Language Model Alignment Experiments

johnswentworth 27 Oct 2023 20:53 UTC
7 points
1
If my understanding of your position is correct: you wouldn’t disagree with that claim, but you would doubt there’s a good path to a strong corrigible agent of that approximate form built atop something like modern architecture language models but scaled up in capability.
Yes, though that’s separate from the point of the post.
The post is not trying to argue that corrigibility in LLMs is difficult, or that demonstrating (weak) corrigibility in LLMs is difficult. The post is saying that certain ways of measuring corrigibility in LLMs fail to do so, and people should measure it in a way which actually measures what they’re trying to measure.
In particular, I am definitely not saying that everyone arguing that LLMs are corrigible/aligned/etc are making the mistake from the post.
There have been times where the other person goes into teacher-mode and tries e.g. a socratic dialogue to try to get me to realize an error they think I’m making, only to discover at the end some minutes later that the claim I was making was unrelated and not in contradiction with the point they were making.
I indeed worry about this failure-mode, and am quite open to evidence that I’m mis-modeling people.
(In practice, when I write this sort of thing, I usually get lots of people saying “man, that’s harsh/inconsiderate/undiplomatic/etc” but a notable lack of people arguing that my model-of-other-people is wrong. I would be a lot happier if people actually told me where my model was wrong.)
- 1a3orn 30 Oct 2023 16:20 UTC
  11 points
  0
  Parent
  I mean, fundamentally, I think if someone offers X as evidence of Y in implicit context Z, and is correct about this, but makes a mistake in their reasoning while doing so, a reasonable response is “Good insight, but you should be more careful in way M,” rather than “Here’s your mistake, you’re gullible and I will recognize you only as student,” with zero acknowledgment of X being actually evidence for Y in implicit context Z.
  
  Suppose someone had endorsed some intellectual principles along these lines:
  
  Same thing here. If you measure whether a language model says it’s corrigible, then an honest claim would be “the language model says it’s corrigible”. To summarize that as “showing corrigibility in a language model” (as Simon does in the first line of this post) is, at best, extremely misleading under what-I-understand-to-be ordinary norms of scientific discourse....
  
  Returning to the frame of evidence strength: part of the reason for this sort of norm is that it lets the listener decide how much evidence “person says X” gives about “X”, rather than the claimant making that decision on everybody else’ behalf and then trying to propagate their conclusion.
  
  I think applying this norm to judgements about people’s character straightforwardly means that it’s great to show how people make mistakes and to explain them; but the part where you move from “person A says B, which is mistaken in way C” to “person A says B, which is mistaken in way C, which is why they’re gullible” is absolutely not good move under the what-I-understand-to-be-ordinary norms of scientific discourse.
  
  Someone who did that would be straightforwardly making a particular decision on everyone else’s behalf and trying to propagate their conclusion, rather than simply offering evidence.

johnswentworth comments on Symbol/​Referent Confusions in Language Model Alignment Experiments

johnswentworth comments on Symbol/Referent Confusions in Language Model Alignment Experiments