Rohin Shah comments on Reviewing the Review

Rohin Shah 26 Feb 2020 17:14 UTC
2 points
It’s possible the Alignment Review might include things not published on the Alignment Forum
I don’t want this. There’s a field of alignment outside of the community that uses the Alignment Forum, with very different ideas about how progress is made; it seems bad to have an evaluation of work they produce according to metrics that they don’t endorse.
(I’m imagining a new entrant to the field deciding whether to go to CHAI, Ought, FHI, OpenAI or DeepMind based on the results of this review, and thinking “oh god no that sounds horrible”.)
- Raemon 26 Feb 2020 22:53 UTC
  4 points
  Parent
  Nod. (to be clear this is still all in the early brainstorming stage. I ran the idea by ~3 alignment people, but an Alignment Review of any kind, AF-forum specific or otherwise, would be something I wanted to get a lot of buy-in for and iron out the details of before attempting)
  (I’m imagining a new entrant to the field deciding whether to go to CHAI, Ought, FHI, OpenAI or DeepMind based on the results of this review, and thinking “oh god no that sounds horrible”.)
  I’m not actually sure I understood the intended point here, wondering if you could rephrase it in somewhat different words.
  Current Thoughts
  Here’s a more fleshed out version of my current thinking (again, all of this is intended to be early stage brainstorming-type-thoughts)
  Taking a step back, there a few different problems that seem worth solving. The core thing is “getting alignment right is a high stakes and confusing problem, and many people disagree about what counts as progress.”
  One problem is “Even within individual alignment paradigms, there’s not much common knowledge of what sorts of results are meaningful. Or even what exactly the paradigm is.” (this currently seems true to me, although I can imagine some paradigms being clearer than others)
  Another problem is “It’s not obvious how to evaluate across paradigms, or what paradigm is right, or if multiple paradigms are right for different reasons, how to integrate them.” And that’s also an important problem to actually solve some day.
  I can see reasons to not try to solve the second problem until there’s clearer consensus on the first problem, within at least some clusters of researchers. But I can also imagine them not being (much) harder to solve together than separately.
  A few possibilities I can imagine, knobs to turn, and/or considerations:
  Considerations
  Less “rank ordering evaluative-ness”?
  If I naively imagine using something close to the 2019 review for alignment (even within a single paradigm), I expect my concerns about “sort by prestige” to be much worse, because there are greater political consequences that one could screw up (and, lack of common knowledge about how large those consequences are and how bad they might be might make everyone too anxious to get buy-in).
  So, I would currently guess that even for an Alignment Forum centric review that just focused on itself, I’d be particularly wary of the output being obviously rank-ordered. It may be better to be more like a survey that focuses on qualitative rather than quantitative questions. (In places where it’s necessary to produce a list of posts, have the list randomized or something)
  (I think at some point it is fairly important to actually be able to say ‘this stuff was most important’ and give prizes and stuff, but that might be something you do more Nobel Prize style a decade+ after the fact)
  Option #1 – AF-Centric
  The simplest option is “just focus on AF evaluating itself.” This comes with less downside risk, but also less upside. I think there’s a decent chance that the most important progress will have come outside of the Alignment Forum. If AF isn’t producing ideas that seem meaningful outside of its own ecosystem, that’s important to know. (It’s not obviously bad, maybe the ideas don’t seem meaningful because the broader ecosystem is wrong, but knowing which sorts of posts were seemed good to a wider variety of researchers is useful)
  Option #2 – A Non-Alignment-Forum-Review, with Buy In From Everyone
  The far end of a spectrum would be to not host the review on Alignment Forum, instead creating a new body that’s specifically aiming to be representative of various subfields and paradigms of people who are working on something reasonably called “AI alignment”, and get each of their opinions.
  It’s not really obvious where to draw that boundary, but it’d likely make sense to get people OpenAI, Deepmind and CHAI. (I’m not as up-to-date on what Ought or FHI do, but depending on what sort of research they do I could see it)
  Option #3 – Alignment Forum Review of Everything, with Buy In From Everyone
  Alternately, you might say “well, we tried to create a place where people from various paradigms could communicate with each other, and that place was the Alignment Forum, and insofar as that feels like a place with some-particular-paradigm where other paradigms are kinda unwelcome, that’s something that should be fixed, and part of the point here might be to extend an olive branch to other places and get them more involved.”
  A recent observation someone made in person is that the AF is filtered for people who like to comment on the internet, which isn’t the same filter as “people who like Agent Foundations”, but there is some history that sort of conflates that. And meanwhile researchers elsewhere may not want to get dragged into internet discussions in the first place.
  One thing the Review might offer is a way for people-who-don’t-like-internet-discussions to quickly fill out a survey that gives them more of a voice into “the AF consensus” without having to get involved with extended discussions. (Or to have those discussions be time-boxed to one month.)
  This is currently my favorite option (though not one that’s obviously endorsed by anyone else).
  Option #4 – Alignment Forum Review of Everything, with Buy In From AF and that’s it
  Somewhere else on a (possibly multidimensional) spectrum is to say “Okay, AF has some kind of opinion on what paradigms are good. That’s supposed to be a relatively broad consensus that people at CHAI, OpenAI, MIRI and at least some people from deepmind. It might currently disproportionately favor MIRI for historical reasons, which is kinda bad. But, the optimal version of itself is still somewhat opinionated, compared to the broader landscape.”
  And in that case, yes, it’d be evaluating things on different metrics than people wanted for themselves. But… that seems fine? It’s an important part of science as an institution that people get to evaluate you based on different things than you might have wanted to be evaluated on.
  Conferences you submit papers to can reject you, journals might be aiming to focus on particular subfields and maybe you think your thing is relevant to their subfield but they don’t.
  Most nonprofits aren’t trying to optimize for Givewell’s goals, but it was good that Givewell set up a system of evaluating that said “if you care about goal X, we think these are the best nonprofits, here’s why.” The collective system of evaluation from different vantage points is a key piece of intellectual progress.
  - Vanessa Kosoy 27 Feb 2020 6:48 UTC
    5 points
    Parent
    
    If I naively imagine using something close to the 2019 review for alignment (even within a single paradigm), I expect my concerns about “sort by prestige” to be much worse, because there are greater political consequences that one could screw up (and, lack of common knowledge about how large those consequences are and how bad they might be might make everyone too anxious to get buy-in).
    
    I don’t think so.
    
    Your main example for the prestige problem with the LW review was “affordance widths”. I admit that I was one of the people who assigned a lot of negative points to “affordance widths”, and also that I did it not purely on abstract epistemic grounds (in those terms the essay is merely mediocre) but because of the added context about the author. When I voted, the question I was answering was “should this be included in Best of 2018”, including all considerations. If I wasn’t supposed to do this then I’m sorry, I haven’t noticed before.
    
    The main reason I think it would be terrible to include “affordance widths” is not exactly prestige. The argument I used before is prestige-based, but that’s because I expected this part to be more broadly accepted, and wished to avoid the more charged debate I anticipated if I ventured closer to the core. The main reason is, I think it would send a really bad message to women and other vulnerable populations who are interested in LessWrong: not because of the identity of the author, but because the essay was obviously designed to justify the author’s behavior. Some of the reputational ramifications of that would be well-earned (although I also expect the response to be disproportional).
    
    On the other hand, it is hard for me to imagine anything of the sort applying to the Alignment Forum. It would be much more tricky to somehow justify sexual abuse through discussion about AI risk, and if someone accomplished it then surely the AI-alignment-qua-AI-alignment value of that work would be very low. The sort of political considerations that do apply here are not considerations that would affect my vote, and I suspect (although ofc I cannot be sure) the same is true about most other voters.
    
    Also, next time I will adjust my behavior in the LW vote also, since clearly it is against the intent of the organizers. However, I suggest that some process is created in parallel to the main vote, where context-dependent considerations can be brought up, either for public discussion or for the attention of the moderator team specifically.
    - Raemon 27 Feb 2020 21:17 UTC
      2 points
      Parent
      To be clear, given the vote system we went with (which basically rolled all considerations into a single vote, and ask), I don’t think there was anything wrong with voting against Affordance Widths for that reason.
      I saw this more as “the system wasn’t well designed, we should use a better system next time.”
      (Different LW team members also had different opinions on what exactly the Review should be doing and why, and some changed their mind over the course of the process, which is part of why some of the messaging was mixed).
      The reason I thought (at the time) it was best to “just collapse everything into one vote, which is tied fairly closely to ‘what should be in the book?’” was that if you told people it was about “being honest about good epistemics”, but the result still ended up influencing the book, you’d have something of an asshole filter where some people vote strategically and are disproportionately rewarded.”
      I think I may have some conceptual disagreements with your framing, but my current goal for next year is to structure things in a way that separates out truth, usefulness, and broader reputational effects from each other, so that the process is more robust to people coming at it with different goals and frames.
      - Raemon 27 Feb 2020 21:26 UTC
        2 points
        Parent
        The reason I’m more worried about this for an Alignment Review is that the stakes are higher, and it is not only important that the process be epistemically sound, but for everyone to believe it’s epistemically sound and/or fair. (And meanwhile sexual abuse isn’t the only possible worrisome thing to come up)
  - Rohin Shah 27 Feb 2020 0:07 UTC
    2 points
    Parent
    I’m not actually sure I understood the intended point here, wondering if you could rephrase it in somewhat different words.
    There are two pretty different approaches to AI safety, which I could uncharitably call MIRI-rationalist vs. everyone else. (I don’t have an accurate charitable name for the difference.) I claim that AF sees mostly just the former perspective. See this comment thread. (Standard disclaimers about “actually this is a spectrum and there are lots of other features that people disagree on”, the point is that this is an important higher-order bit.)
    I think that for both sides:
    Their work is plausibly useful
    They don’t have a good model of why the other side’s work is useful
    They don’t expect the other side’s work to be useful on their own models
    Given this, I expect that ratings by one side of the other side’s work will not have much correlation with which work is actually useful.
    So, such a rating seems to have not much upside, and does have downside, in that non-experts who look at these ratings and believe them will get wrong beliefs about which work is useful.
    (I already see people interested in working on CHAI-style stuff who say things that MIRI-rationalist viewpoint says where my internal response is something like “I wish you hadn’t internalized these ideas before coming here”.)
    I expect my concerns about “sort by prestige” to be much worse
    I agree with this but it’s not my main worry.
    The far end of a spectrum would be to not host the review on Alignment Forum, instead creating a new body that’s specifically aiming to be representative of various subfields and paradigms of people who are working on something reasonably called “AI alignment”, and get each of their opinions.
    This would be good if it could be done; I’d support it (assuming that you actually get a representative body). I think this is hard, but that doesn’t mean it’s impossible / not worth doing, and I’d want a lot of the effort to be in ensuring that you get a representative body.
    A recent observation someone made in person is that the AF is filtered for people who like to comment on the internet, which isn’t the same filter as “people who like Agent Foundations”, but there is some history that sort of conflates that. And meanwhile researchers elsewhere may not want to get dragged into internet discussions in the first place.
    I don’t think this is the main selection effect to worry about.
    Okay, AF has some kind of opinion on what paradigms are good. That’s supposed to be a relatively broad consensus that people at CHAI, OpenAI, MIRI and at least some people from deepmind.
    It’s not a broad consensus. CHAI has ~10 grad students + a few professors, research engineers, and undergrads; only Daniel Filan and I could reasonably said to be part of AF. OpenAI has a pretty big safety team (>10 probably); only Paul Christiano could reasonably be said to be part of AF. Similarly for DeepMind, where only Richard Ngo would count.
    But, the optimal version of itself is still somewhat opinionated, compared to the broader landscape.
    Seems right; we just seem very far from this version.
    And in that case, yes, it’d be evaluating things on different metrics than people wanted for themselves. But… that seems fine? It’s an important part of science as an institution that people get to evaluate you based on different things than you might have wanted to be evaluated on.
    Agreed for the optimal version.
    Conferences you submit papers to can reject you, journals might be aiming to focus on particular subfields and maybe you think your thing is relevant to their subfield but they don’t.
    I’d be pretty worried if a bunch of biology researchers had to decide which physics papers should be published. (This exaggerates the problem, but I think it does qualitatively describe the problem.)
    Most nonprofits aren’t trying to optimize for Givewell’s goals, but it was good that Givewell set up a system of evaluating that said “if you care about goal X, we think these are the best nonprofits, here’s why.”
    Nonprofits should be accountable to their donors. X-risk research should be accountable to reality. You might think that accountability to an AF review would be a good proxy for this, but I think it is not.
    (You might find it controversial to claim that nonprofits should be accountable to donors, in which case I’d ask why it is good for GiveWell to set up such a system of evaluation. Though this is not very cruxy for me so maybe just ignore it.)
    - Vanessa Kosoy 27 Feb 2020 7:12 UTC
      24 points
      Parent
      
      I don’t want this. There’s a field of alignment outside of the community that uses the Alignment Forum, with very different ideas about how progress is made; it seems bad to have an evaluation of work they produce according to metrics that they don’t endorse.
      
      This seems like a very strange claim to me. If the proponents of the MIRI-rationalist view think that (say) a paper by DeepMind has valuable insights from the perspective of the MIRI-rationalist paradigm, and should be featured in “best [according to MIRI-rationalists] of AI alignment work in 2018”, how is it bad? On the contrary, it is very valuable the the MIRI-rationalist community is able to draw each other’s attention to this important paper.
      
      So, such a rating seems to have not much upside, and does have downside, in that non-experts who look at these ratings and believe them will get wrong beliefs about which work is useful.
      
      Anything anyone says publicly can be read by a non-expert, and if something wrong was said, and the non-expert believes it, then the non-expert gets wrong beliefs. This is a general problem with non-experts, and I don’t see how is it worse here. Of course if the MIRI-rationalist viewpoint is true then the resulting beliefs will not be wrong at all. But this just brings us back to the object-level question.
      
      (I already see people interested in working on CHAI-style stuff who say things that MIRI-rationalist viewpoint says where my internal response is something like “I wish you hadn’t internalized these ideas before coming here”.)
      
      So, not only is the MIRI-rationalist viewpoint wrong, it is so wrong that it irreversibly poisons the mind of anyone exposed to it? Isn’t it a good idea to let people evaluate ideas on their own merits? If someone endorses a wrong idea, shouldn’t you be able to convince em by presenting counterarguments? If you cannot present counterarguments, how are you so sure the idea is actually wrong? If the person in question cannot understand the counterargument, doesn’t it make em much less valuable for your style of work anyway? Finally, if you actually believe this, doesn’t it undermine the entire principle of AI debate? ;)
      - Rohin Shah 27 Feb 2020 17:29 UTC
        5 points
        Parent
        If the proponents of the MIRI-rationalist view think that (say) a paper by DeepMind has valuable insights from the perspective of the MIRI-rationalist paradigm, and should be featured in “best [according to MIRI-rationalists] of AI alignment work in 2018”
        That seems mostly fine and good to me, but I predict it mostly won’t happen (which is why I said “They don’t expect the other side’s work to be useful on their own models”). I think you still have the “poisoning” problem as you call it, but I’m much less worried about it.
        I’m more worried about the rankings and reviews, which have a much stronger “poisoning” problem.
        Anything anyone says publicly can be read by a non-expert, and if something wrong was said, and the non-expert believes it, then the non-expert gets wrong beliefs. This is a general problem with non-experts, and I don’t see how is it worse here.
        Many more people are likely to read the results of a review, relative to arguments in the comments of a linkpost to a paper.
        Calling something a “review”, with a clear process for generating a ranking, grants it much more legitimacy that one person saying something on the Internet.
        So, not only is the MIRI-rationalist viewpoint wrong, it is so wrong that it irreversibly poisons the mind of anyone exposed to it?
        Not irreversibly.
        Isn’t it a good idea to let people evaluate ideas on their own merits?
        When presented with the strongest arguments for both sides, yes. Empirically that doesn’t happen.
        If someone endorses a wrong idea, shouldn’t you be able to convince em by presenting counterarguments?
        I sometimes can and have. However, I don’t have infinite time. (You think I endorse wrong ideas. Why haven’t you been able to convince me by presenting counterarguments?)
        Also, for non-experts this is not necessarily true (or is true only in some vacuous sense). If a non-expert sees within a community of experts 50 people arguing for A, and 1 person arguing for not-A, even if they find the arguments for not-A compelling, in most cases they should still put high credence on A.
        (The vacuous sense in which it’s true is that the non-expert could become an expert by spending hundreds or thousands of hours becoming an expert, in which case they can evaluate the arguments on their own merits.)
        If you cannot present counterarguments, how are you so sure the idea is actually wrong?
        I in fact can present counterarguments, it just takes a long time.
        If the person in question cannot understand the counterargument, doesn’t it make em much less valuable for your style of work anyway?
        Empirically, it seems that humans have very “sticky” worldviews, such that whichever worldview they first inhabit, it’s very unlikely that they switch to the other worldview. So depending on what you mean by “understand”, I could have two responses:
        They “could have” understood (and generated themselves) the counterargument if they had started out in the opposite worldview
        No one currently in the field is able to “understand” the arguments of the other side, so it’s not a sign of incompetence if a new person cannot “understand” such an argument
        Obviously ideal Bayesians wouldn’t have “sticky” worldviews; it turns out humans aren’t ideal Bayesians.
        Finally, if you actually believe this, doesn’t it undermine the entire principle of AI debate?
        If you mean debate as a proposal for AI alignment, you might hope that we can create AI systems that are closer to ideal Bayesian reasoners than we are, or you might hope that humans who think for a very long time are closer to ideal Bayesian reasoners. Either way, I agree this is a problem that would have to be dealt with.
        If you mean debate as in “through debate, AI alignment researchers will have better beliefs”, then yes, it does undermine this principle. (You might have noticed that not many alignment researchers try to do this sort of debate.)
    - Raemon 27 Feb 2020 0:50 UTC
      4 points
      Parent
      A lot of those concerns seem valid. I recalled the earlier comment thread and had it in mind while I was writing the response comment. (I agree that “viewpoint X” is a thing, and I don’t even think it’s that uncharitable to call it the MIRI/rationalist viewpoint, although it’s simplified)
      Fwiw, while I prefer option #3 (I just added #s to the options for easier reference), #2 and #4 both seem pretty fine. And whichever option one went with, getting representative members seems like an important thing to put a lot of effort into.
      My current sense is that AF was aiming to be a place where people-other-than-Paul at OpenAI would feel comfortable participating. I can imagine it turns out “AF already failed to be this sufficiently that if you want that, you need to start over,” but it is moderately expensive to start over. I would agree that this would require a lot of work, but seems potentially quite important and worthwhile.
      What are the failure modes you imagine, and/or how much harder do you think it is, to host the review on AF, while aiming for a broader base of participants that AF currently feels oriented towards? (As compared to the “try for a broad base of participants and host it somewhere other than AF”)
      - Raemon 27 Feb 2020 1:08 UTC
        2 points
        Parent
        Random other things I thought about:
        I can definitely imagine “it turns out to get people involved you need more anonymity/plausible-deniability than a public forum affords”, so starting from a different vantage point is better.
        One of the options someone proposed was “CHAI, MIRI, OpenAI and Deepmind [potentially other orgs] are each sort of treated as an entity in a parliament, with N vote-weight each. It’s up to them how they distribute that vote weight among their internal teams.” I think I’d weakly prefer “actually you just really try to get more people from each team to participate, so you end up with information from 20 individuals rather than 4 opaque orgs”, but I can imagine a few reasons why the former is more practical (with the plausible deniability being a feature/bug combo)
        Rohin Shah 27 Feb 2020 16:59 UTC
        2 points
        Parent
        My current sense is that AF was aiming to be a place where people-other-than-Paul at OpenAI would feel comfortable participating.
        Agreed that that was the goal; I’m arguing that it has failed at this. (Or, well, maybe they’d be comfortable participating, but they don’t see the value in participating.)
        What are the failure modes you imagine, and/or how much harder do you think it is, to host the review on AF, while aiming for a broader base of participants that AF currently feels oriented towards?
        Mainly I think it would be really hard to get that broader base of participants. I imagine trying to convince specific people (not going to name names) that they should be participating, and the only argument that I think might be convincing to them would be “if we don’t participate, then our work will be evaluated by MIRI-rationalist standards, and future entrants to the field will forever misunderstand our work in the same way that people forever misunderstand CIRL”. It seems pretty bad to rely on that argument.
        I think you might be underestimating how different these two groups are. Like, it’s not just that they work on different things, they also have different opinions on the best ways to publish, what should count as good work, the value of theoretical vs. conceptual vs. empirical work, etc. Certainly most are glad that the other exists in the sense that they think it is better than nothing (but not everyone meets even this low bar), but beyond that there’s not much agreement on anything. I expect the default reaction to be “this review isn’t worth my time”.
        I can definitely imagine “it turns out to get people involved you need more anonymity/plausible-deniability than a public forum affords”, so starting from a different vantage point is better.
        As above, I expect the default reaction to be “this review isn’t worth my time”, rather than something like “I need plausible deniability to evaluate other people’s work”.
        One of the options someone proposed was “CHAI, MIRI, OpenAI and Deepmind [potentially other orgs] are each sort of treated as an entity in a parliament, with N vote-weight each. It’s up to them how they distribute that vote weight among their internal teams.”
        This sort of mechanism doesn’t address the “review isn’t worth my time” problem. It would probably give you a more unbiased estimate of what the “field” thinks, but only because e.g. Richard and I would get a very large vote weight. (And even that isn’t unbiased—Richard and I are much closer to the MIRI-rationalist viewpoint than the average for our orgs.)
    - Raemon 27 Feb 2020 0:59 UTC
      2 points
      Parent
      On the Givewell example:
      Some noteworthy things about Givewell is that it’s not really trying to make all nonprofits accountable to donors (since most nonprofits aren’t even ranked). It’s trying to answer a particular question, for a subset of the donor population.
      By contrast, something like CharityNavigator is aiming to cover a broad swath of nonprofits and is more implicitly claiming that all nonprofits should be more accountable-on-average than they currently are.
      It’s also noteworthy that Givewell’s paradigm is distinct from the general claims of “nonprofits should be accountable”, or utilitarianism, or other EA frameworks. Givewell is doing one fairly specific thing, which is different from what CharityNavigator or OpenPhil are doing.
      I do think CharityNavigator is an important and perhaps relevant example since they’re optimizing a metric that I think is wrong. I think it’s probably still at least somewhat good that CharityNavigator exists, since it moves the overall conversation of “we should be trying to evaluate nonprofits” forward, and creating more transparency than there used to be. I could be persuaded that CharityNavigator was net-negative though.
      I’d be pretty worried if a bunch of biology researchers had to decide which physics papers should be published. (This exaggerates the problem, but I think it does qualitatively describe the problem.)
      There’s a pretty big distinction between “decide which papers get published.” If some biologists started a journal that dealt with physics (because they thought they had some reason to believe they had a unique and valuable take on Physics And Biology) that might be weird, perhaps bad. But, it wouldn’t be “decide what physics things get published.” It’d be “some biologists start a weird Physics Journal with it’s own kinda weird submission criteria.”
      (I think that might potentially be bad, from a “affecting signal/noise ratio” axis, but also I don’t think the metaphor is that good – the only reason it feels potentially bad is because of the huge disconnect between physics and biology, and and “biologists start a journal about some facet of biology that intersects with some other field that’s actually plausibly relevant to biology” feels fine)
      - Rohin Shah 27 Feb 2020 17:06 UTC
        2 points
        Parent
        If some biologists started a journal that dealt with physics (because they thought they had some reason to believe they had a unique and valuable take on Physics And Biology) that might be weird, perhaps bad. But, it wouldn’t be “decide what physics things get published.” It’d be “some biologists start a weird Physics Journal with it’s own kinda weird submission criteria.”
        I in fact meant “decide what physics things get published”; in this counterfactual every physics journal / conference sends their submissions to biologists for peer review and a decision on whether it should be published. I think that is more correctly pointing at the problems I am worried about than “some biologists start a new physics journal”.
        Like, it is not the case that there already exists a public evaluation mechanism for work coming out of CHAI / OpenAI / DeepMind. (I guess you could look at whether the papers they produce are published in some top conference, but this isn’t something OpenAI and DeepMind try very hard to do, and in any case that’s a pretty bad evaluation mechanism because it’s evaluating by the standards of the regular AI field, not the standards of AI safety.) So creating a public evaluation mechanism when none exists is automatically going to get some of the legitimacy, at least for non-experts.