Koen.Holtman comments on Solve Corrigibility Week

Koen.Holtman 2 Dec 2021 11:03 UTC
LW: 2 AF: 1
AF

Becoming explicit about what different properties you mean and which metrics they score well on resolves the disagreement.

Indeed this can resolve disagreement among a small sub-group of active participants. This is an important tool if you want to make any progress.

but maybe I’m missing something?

The point I was trying to make is about what is achievable for the entire community, not what is achievable for a small sub-group of committed participants. The community of people who post on this site have absolutely no mechanism for agreeing among themselves whether a problem has been solved, or whether some sub-group has made meaningful progress on it.

To make the same point in another way: the forces which introduce disagreeing viewpoints and linguistic entropy to this forum are stronger than the forces that push towards agreement and clarity.

My thinking about how strong these forces are has been updated recently, by the posting of a whole sequence of Yudkowsky conversations and also this one. In these discussion logs, Yudkowsky goes to full Great more-epistemic-than-thou Philosopher mode, Confidently Predicting AGI Doom while Confidently Dismissing Everybody’s AGI Alignment Research Results. Painful to read.

I am way past Denial and Bargaining, I have Accepted that this site is a community of philosophers.
- Logan Riggs 2 Dec 2021 18:10 UTC
  LW: 5 AF: 4
  AF Parent
  The linguistic entropy point is countered by my previous point, right? Unless you want to say not everyone who posts in this community is capable of doing that? Or can naturally do that?
  In these discussion logs, Yudkowsky goes to full Great more-epistemic-than-thou Philosopher mode, Confidently Predicting AGI Doom while Confidently Dismissing Everybody’s AGI Alignment Research Results. Painful to read.
  Hahaha, yes. Yudkowsky can easily be interpreted as condescending and annoying in those dialogues (and he could’ve done a better job at not coming across that way). Though I believe the majority of the comments were in the spirit of understanding and coming to an agreement. Adam Shimi is also working on a post to describe the disagreements in the dialogue as different epistemic strategies, meaning the cause of disagreement is non-obvious. Alignment is pre-paradigmic, so agreeing is more difficult compared to communities that have clear questions and metrics to measure them on. I still think we succeed at the harder problem.
  I am way past Denial and Bargaining, I have Accepted that this site is a community of philosophers.
  By “community of philosophers”, you mean noone makes any actual progress on anything (or can agree that progress is being made)?
  - I believe Alex Turner has made progress on formalizing impact and power-seeking and I’m not aware of parts of the community arguing this isn’t progress at all (though I don’t read every comment).
  - I also believe Vanessa’s and Diffractor’s Infrabayesism is progress on thinking about probabilities, and am unaware of parts of the community arguing this isn’t progress (though there is a high mathematical bar required before you can understand it enough to criticize it)
  - I also also believe Evan Hubingers et al work on mesa optimizers is quite clearly progress on crisply stating an alignment issue that the community has largely agreed is progress.
  Do you disagree on these examples or disagree that they prove the community makes progress and agrees that progress is being made?
  - Koen.Holtman 5 Dec 2021 14:13 UTC
    LW: 1 AF: 1
    AF Parent
    Yes, by calling this site a “community of philosophers”, I roughly mean that at the level of the entire community, nobody can agree that progress is being made. There is no mechanism for creating a community-wide agreement that a problem has been solved.
    
    You give three specific examples of progress above. From his recent writings, it is clear that Yudkowsky does not believe, like you do, that any contributions posted on this site in the last few years have made any meaningful progress towards solving alignment. You and I may agree that some or all of the above three examples represent some form of progress, but you and I are not the entire community here, Yudkowsky is also part of it.
    
    On the last one of your three examples, I feel that ‘mesa optimizers’ is another regrettable example of the forces of linguistic entropy overwhelming any attempts at developing crisply stated definitions which are then accepted and leveraged by the entire community. It is not like the people posting on this site are incapable of using the tools needed to crisply define things, the problem is that many do not seem very interested in ever using other people’s definitions or models as a frame of reference. They’d rather free-associate on the term, and then develop their own strongly held beliefs of what it is all supposed to be about.
    
    I am sensing from your comments that you believe that, with more hard work and further progress on understanding alignment, it will in theory be possible to make this community agree, in future, that certain alignment problems have been solved. I, on the other hand, do not believe that it is possible to ever reach that state of agreement in this community, because the debating rules of philosophy apply here.
    
    Philosophers are always allowed to disagree based on strongly held intuitive beliefs that they cannot be expected to explain any further. The type of agreement you seek is only possible in a sub-community which is willing to use more strict rules of debate.
    
    This has implications for policy-related alignment work. If you want to make a policy proposal that has a chance of being accepted, it is generally required that you can point to some community of subject matter experts who agree on the coherence and effectiveness of your proposal. LW/AF cannot serve as such a community of experts.
    - TAG 5 Dec 2021 19:24 UTC
      1 point
      Parent
      
      On the last one of your three examples, I feel that ‘mesa optimizers’ is another regrettable example of the forces of linguistic entropy overwhelming any attempts at developing crisply stated definitions which are then accepted and leveraged by the entire community. It is not like the people posting on this site are incapable of using the tools needed to crisply define things, the problem is that many do not seem very interested in ever using other people’s definitions or models as a frame of reference. They’d rather free-associate on the term, and then develop their own strongly held beliefs of what it is all supposed to be about
      
      Yes.. clarity isn’t optional.
      
      MIRI abandonned the idea of producing technology a long time ago , so what it will offer to the the people who are working on AI technology is some kind of theory expressed by n some kind of document ..which will be of no use to them if they can’t understand it.
      
      And it takes a constant parallel effort to keep the lines of communication open. It’s no use “woodshedding” , spending a lot of time developing your own ideas in your own language.