johnswentworth comments on Symbol/Referent Confusions in Language Model Alignment Experiments

johnswentworth 26 Oct 2023 22:59 UTC
11 points
6
Consider this through the lens of epistemic standards for discourse, as opposed to evidence strength.
Like, consider psychology studies. IIUC, if a psychology study estimates how many people have ever done X by asking people how many times they’ve ever done X, then that study title would usually be expected to say something like “this many people report having done X”, as opposed to “this many people have done X”. If the study title was “this many people have done X”, when their methodology was actually just to ask people, then we’d consider that a form of low-key dishonesty. It’s a misleading headline, at the bare minimum. The sort of thing where colleagues read it and give you an annoyed glare for over-sensationalizing your title.
Same thing here. If you measure whether a language model says it’s corrigible, then an honest claim would be “the language model says it’s corrigible”. To summarize that as “showing corrigibility in a language model” (as Simon does in the first line of this post) is, at best, extremely misleading under what-I-understand-to-be ordinary norms of scientific discourse.
(Though, to be clear, my strong expectation is that most people making the sort of claim Simon does in the post are not being intentionally misleading. I expect that they usually do not notice at all that they’re measuring whether the LM claims to be corrigible, rather than whether the LM is corrigible.)
Returning to the frame of evidence strength: part of the reason for this sort of norm is that it lets the listener decide how much evidence “person says X” gives about “X”, rather than the claimant making that decision on everybody else’ behalf and then trying to propagate their conclusion.
- tailcalled 27 Oct 2023 6:33 UTC
  11 points
  4
  Parent
  
  Like, consider psychology studies. IIUC, if a psychology study estimates how many people have ever done X by asking people how many times they’ve ever done X, then that study title would usually be expected to say something like “this many people report having done X”, as opposed to “this many people have done X”. If the study title was “this many people have done X”, when their methodology was actually just to ask people, then we’d consider that a form of low-key dishonesty. It’s a misleading headline, at the bare minimum. The sort of thing where colleagues read it and give you an annoyed glare for over-sensationalizing your title.
  
  That is a very optimistic view of psychology research. In practice, a Survey Chicken points out, psychology research overwhelmingly makes strong claims due to surveys:
  
  In the abstract, I think a lot of people would agree with me that surveys are bullshit. What I don’t think is widely known is how much “knowledge” is based on survey evidence, and what poor evidence it makes in the contexts in which it is used. The nutrition study that claims that eating hot chili peppers makes you live longer is based on surveys. The twin study about the heritability of joining a gang or carrying a gun is based on surveys of young people. The economics study claiming that long commutes reduce happiness is based on surveys, as are all studies of happiness, like the one that claims that people without a college degree are much less happy than they were in the 1970s. The study that claims that pornography is a substitute for marriage is based on surveys. That criminology statistic about domestic violence or sexual assault or drug use or the association of crime with personality factors is almost certainly based on surveys. (Violent crime studies and statistics are particularly likely to be based on extremely cursed instruments, especially the Conflict Tactics Scale, the Sexual Experiences Survey, and their descendants.) Medical studies of pain and fatigue rely on surveys. Almost every study of a psychiatric condition is based on surveys, even if an expert interviewer is taking the survey on the subject’s behalf (e.g. the Hamilton Depression Rating Scale). Many studies that purport to be about suicide are actually based on surveys of suicidal thoughts or behaviors. In the field of political science, election polls and elections themselves are surveys.
  
  There are a few reasons for this.
  
  One would be that scientific standards in psychology (and really lots of sciences?) are abysmal, so people get away with making sketchy claims on weak evidence.
  
  A second is that surveys are very cheap and efficient. You can get tons of bits of information from a person with very little time and effort.
  
  But I think a third reason is more optimistic (at least wrt the choice of surveys, maybe not wrt the feasibility of social science): surveys are typically the most accurate source of evidence.
  
  Like if you want to study behaviors, you could use lab experiments, but this assumes you can set up a situation in your lab that is analogous with situations in real life, which is both difficult and expensive and best validated by surveys that compare the lab results to self-reports. Or e.g. if you go with official records like court records, you run into the issue that very few types of things are recorded, and the types of things that are recorded may not have their instances recorded much more reliably than surveys do. Or e.g. maybe you can do interventions in real life, e.g. those studies that send out job applications to test for discrimination, but there are only few places where you may intervene for a scientific study, and even when you do, ypur information stream is low-bandwidth and so noisy that you need to aggregate things statistically and eliminate any individual-level information to have something useful.
  - Nathan Helm-Burger 27 Oct 2023 17:04 UTC
    5 points
    3
    Parent
    I agree with Tailcalled on this, and had a lot of frustration around these issues with psychology when I was studying psych as an undergrad. I also think that johnswentworth’s point stands. The psychologists may not follow the norms he describes in practice, but they clearly OUGHT to, and we should hold ourselves to the higher standard of accuracy. Imitating the flawed example of current psychology practices would be shooting ourselves in the foot, undermining our ability to seek the truth of the matter.
    - tailcalled 27 Oct 2023 17:22 UTC
      4 points
      0
      Parent
      I agree that the point about discourse still stands.
  - MondSemmel 28 Oct 2023 13:27 UTC
    2 points
    0
    Parent
    One point I recall from the book Stumbling on Happiness is (and here I’m paraphrasing from poor memory; plus the book is from 2006 and might be hopelessly outdated) that when you e.g. try to analyze concepts like happiness, a) it’s thorny to even define what it is (e.g. reported happiness in the moment is different from life satisfaction, i.e. a retrospective sense that one’s life went well), and b) it’s hard to find better measurable proxies for this concept than relying on real-time first-person reports. You might e.g. correlate happiness w/ proportion of time spent smiling, but only because the first-person reports of people who smile corroborate that they’re indeed happy. Etc. Put differently, if first-person reports are unreliable, but your proxies rely on those first-person reports, then it’s hard to find a measure that’s more reliable.
- TurnTrout 30 Nov 2023 11:40 UTC
  4 points
  0
  Parent
  Returning to the frame of evidence strength: part of the reason for this sort of norm is that it lets the listener decide how much evidence “person says X” gives about “X”, rather than the claimant making that decision on everybody else’ behalf and then trying to propagate their conclusion.
  This is a good norm. Sideways of this point, though, it seems to me that it’d be good to note both “it’s confused to say X” and also “but there’s a straightforward recovery Y of the main point, which some people find convincing and others don’t.”

johnswentworth comments on Symbol/​Referent Confusions in Language Model Alignment Experiments

johnswentworth comments on Symbol/Referent Confusions in Language Model Alignment Experiments