Vanessa Kosoy comments on Principles of Privacy for Alignment Research

Vanessa Kosoy Jul 28, 2022, 7:39 AM
LW: 21 AF: 11
13
AF
[For the record, here’s previous relevant discussion]

My problem with the “nobody cares” model is that it seems self-defeating. First, if nobody cares about my work, then how would my work help with alignment? I don’t put a lot of stock into building aligned AGI in the basement on the my own. (And not only because I don’t have a basement.) Therefore, any impact I will have flows through my work becoming sufficiently known that somebody who builds AGI ends up using it. Even if I optimistically assume that I will personally be part of that project, my work needs to be sufficiently well-known to attract money and talent to make such a project possible.

Second, I also don’t put a lot of stock into solving alignment all by myself. Therefore, other people need to build on my work. In theory, this only requires it to be well-known in the alignment community. But, to improve our chances of solving the problem we need to make the alignment community bigger. We want to attract more talent, much of which is found in the broader computer science community. This is in direct opposition to preserving the conditions for “nobody cares”.

Third, a lot of people are motivated by fame and status (myself included). Therefore, bringing talent into alignment requires the fame and status to be achievable inside the field. This is obviously also in contradiction with “nobody cares”.

My own thinking about this is: yes, progress in the problems I’m working on can contribute to capability research, but the overall chance of success on the pathway “capability advances driven by theoretical insights” is higher than on the pathway “capability advances driven by trial and error”, even if the first leads to AGI sooner, especially if these theoretical insights are also useful for alignment. I certainly don’t want to encourage the use of my work to advance capability, and I try to discourage anyone who would listen, but I accept the inevitable risk of that happening in exchange for the benefits.

Then again, I’m by no means confident that I’m thinking about all of this in the right way.
- johnswentworth Jul 28, 2022, 6:01 PM
  LW: 4 AF: 3
  1
  AF Parent
  Our work doesn’t necessarily need wide memetic spread to be found by the people who know what to look for. E.g. people playing through the alignment game tree are a lot more likely to realize that ontology identification, grain-of-truth, value drift, etc, are key questions to ask, whereas ML researchers just pushing toward AGI are a lot less likely to ask those questions.
  I do agree that a growing alignment community will add memetic fitness to alignment work in general, which is at least somewhat problematic for the “nobody cares” model. And I do expect there to be at least some steps which need a fairly large alignment community doing “normal” (i.e. paradigmatic) incremental research. For instance, on some paths we need lots of people doing incremental interpretability/ontology research to link up lots of concepts to their representations in a trained system. On the other hand, not all of the foundations need to be very widespread—e.g. in the case of incremental interpretability/ontology research, it’s mostly the interpretability tools which need memetic reach, not e.g. theory around grain-of-truth or value drift.
  - Vanessa Kosoy Jul 29, 2022, 7:45 AM
    LW: 7 AF: 5
    0
    AF Parent
    
    Our work doesn’t necessarily need wide memetic spread to be found by the people who know what to look for. E.g. people playing through the alignment game tree are a lot more likely to realize that ontology identification, grain-of-truth, value drift, etc, are key questions to ask, whereas ML researchers just pushing toward AGI are a lot less likely to ask those questions.
    
    That’s a valid argument, but I can also imagine groups that (i) in a world where alignment research is obscure proceed to create unaligned AGI (ii) in a world where alignment research is famous, use this research when building their AGI. Maybe any such group would be operationally inadequate anyway, but I’m not sure. More generally, it’s possible that in a world where alignment research is a well-known respectable field of study, more people take AI risk seriously.
    
    ...I do expect there to be at least some steps which need a fairly large alignment community doing “normal” (i.e. paradigmatic) incremental research. For instance, on some paths we need lots of people doing incremental interpretability/ontology research to link up lots of concepts to their representations in a trained system. On the other hand, not all of the foundations need to be very widespread—e.g. in the case of incremental interpretability/ontology research, it’s mostly the interpretability tools which need memetic reach, not e.g. theory around grain-of-truth or value drift.
    
    I think I have a somewhat different model of the alignment knowledge tree. From my perspective, the research I’m doing is already paradigmatic. I have a solid-enough paradigm, inside which there are many open problems, and what we need is a bunch of people chipping away at these open problems. Admittedly, the size of this “bunch” is still closer to 10 people than to 1000 people but (i) it’s possible that the open problems will keep multiplying hydra-style, as often happens in math and (ii) memetic fitness would help getting the very best 10 people to do it.
    
    It’s also likely that there will be a “phase II” where the nature of the necessary research becomes very different (e.g. it might involve combining the new theory with neuroscience, or experimental ML research, or hardware engineering), and successful transition to this phase might require getting a lot of new people on board which would also be a lot easier given memetic fitness.