Zach Stein-Perlman comments on Shallow review of live agendas in alignment & safety

Zach Stein-Perlman Nov 28, 2023, 7:00 PM
5 points
0
I am excited about this. I’ve also recently been interested in ideas like nudge researchers to write 1-5 page research agendas, then collect them and advertise the collection.
Possible formats:
- A huge google doc (maybe based on this post); anyone can comment; there’s one or more maintainers; maintainers approve ~all suggestions by researchers about their own research topics and consider suggestions by random people.
- A directory of google docs on particular agendas; the individual google docs are each owned by a relevant researcher, who is responsible for maintaining them; some maintainer-of-the-whole-project occasionally nudges researchers to update their docs and reassigns the topic to someone else if necessary. Random people can make suggestions too.
- (Alex, I think we can do much better than the best textbooks format in terms of organization, readability, and keeping up to date.)
I am interested in helping make something like this happen. Or if it doesn’t happen soon I might try to do it (but I’m not taking responsibility for making this happen). Very interested in suggestions.
(One particular kind-of-suggestion: is there a taxonomy/tree of alignment research directions you like, other than the one in this post? (Note to self: taxonomies have to focus on either methodology or theory of change… probably organize by theory of change and don’t hesitate to point to the same directions/methodologies/artifacts in multiple places.))
- leogao Nov 28, 2023, 11:00 PM
  12 points
  5
  Parent
  There’s also a much harder and less impartial option, which is to have an extremely opinionated survey that basically picks one lens to view the entire field and then describes all agendas with respect to that lens in terms of which particular cruxes/assumptions each agenda runs with. This would necessarily require the authors of the survey to deeply understand all the agendas they’re covering, and inevitably some agendas will receive much more coverage than other agendas.
  This makes it much harder than just stapling together a bunch of people’s descriptions of their own research agendas, and will never be “the” alignment survey because of the opinionatedness. I still think this would have a lot of value though: it would make it much easier to translate ideas between different lenses/notice commonalities, and help with figuring out which cruxes need to be resolved for people to agree.
  Relatedly, I don’t think alignment currently has a lack of different lenses (which is not to say that the different lenses are meaningfully decorrelated). I think alignment has a lack of convergence between people with different lenses. Some of this is because many cruxes are very hard to resolve experimentally today. However, I think even despite that it should be possible to do much better than we currently are—often, it’s not even clear what the cruxes are between different views, or whether two people are thinking about the same thing when they make claims in different language.
  - LawrenceC Nov 29, 2023, 10:07 AM
    7 points
    0
    Parent
    I strongly agree that this would be valuable; if not for the existence of this shallow review I’d consider doing this myself just to serve as a reference for myself.
    - leogao Nov 29, 2023, 10:13 AM
      11 points
      4
      Parent
      Fwiw I think “deep” reviews serve a very different purpose from shallow reviews so I don’t think you should let the existence of shallow reviews prevent you from doing a deep review
  - Steven Byrnes Nov 29, 2023, 1:37 PM
    6 points
    1
    Parent
    I’ve written up an opinionated take on someone else’s technical alignment agenda about three times, and each of those took me something like 100 hours. That was just to clearly state why I disagreed with it; forget about resolving our differences :)
  - M. Y. Zuo Dec 4, 2023, 3:28 AM
    3 points
    −5
    Parent
    Even that is putting it a bit too lightly.
    i.e. Is there even a single, bonafide, novel proof at all?
    Proven mathematically, or otherwise demonstrated with 100% certainty, across the last 10+ years.
    Or is it all just ‘lenses’, subjective views, probabilistic analysis, etc...?
- habryka Nov 28, 2023, 9:35 PM
  4 points
  0
  Parent
  LessWrong does have a relatively fully featured wiki system. Not sure how good of a fit it is, but like, everyone can create tags and edit them and there are edit histories and comment sections for tags and so on.
  We’ve been considering adding the ability for people to also add generic wiki pages, though how to make them visible and allocate attention to them has been a bit unclear.
  - Thane Ruthenis Dec 6, 2023, 3:18 PM
    3 points
    0
    Parent
    how to make them visible and allocate attention to them has been a bit unclear
    Maybe an opt-in/opt-out “novice mode” which turns, say, the first appearance of a niche LW term in every post into a link to that term’s LW wiki page? Which you can turn off in the settings, and which is either on by default (with a notification on how to turn it off), or the sign-up process queries you about whether you want to turn it on, or something along these lines.
    Alternatively, a button for each post which fetches the list of idiosyncratic LW terms mentioned in it, and links to their LW wiki pages?
- Roman Leventov Nov 29, 2023, 4:26 AM
  3 points
  −1
  Parent
  I’ve earlier suggested a principled taxonomy of AI safety work with two dimensions:
  1. System level:
    
    monolithic AI system
    human—AI pair
    AI group/org: CoEm, debate systems
    large-scale hybrid (humans and AIs) society and economy
    AI lab, not to be confused with an “AI org” above: an AI lab is an org composed of humans and increasingly of AIs that creates advanced AI systems. See Hendrycks et al.′ discussion of organisational risks.
  2. Methodological time:
    
    design time: basic research, math, science of agency (cognition, DL, games, cooperation, organisations), algorithms
    manufacturing/training time: RLHF, curriculums, mech interp, ontology/representations engineering, evals, training-time probes and anomaly detection
    deployment/operations time: architecture to prevent LLM misuse or jailbreaking, monitoring, weights security
    evolutionary time: economic and societal incentives, effects of AI on society and psychology, governance.
  So, this taxonomy is a 5x4 matrix, almost all slots or which are interesting, and some of them are severely under-explored.
- Iknownothing Dec 5, 2023, 7:40 PM
  1 point
  0
  Parent
  Hi, we’ve already made a site which does this!