MichaelBowlby comments on A Longlist of Theories of Impact for Interpretability

MichaelBowlby 7 Apr 2022 14:28 UTC
1 point
I think there are quite a lot of worlds where understanding the black box better is bad.
If alignment is really really hard we should expect to fail in which case the more obvious it is that we’ve failed the better because the benefits of safety aren’t fully externalised. This probably doesn’t hold in worlds where we get from not very good AI to AGI very rapidly.
Potentially counterintuitive things happen when information gets more public. In this paper https://www.fhi.ox.ac.uk/wp-content/uploads/Racing-to-the-precipice-a-model-of-artificial-intelligence-development.pdf
increasing information has weird non-linear effects on the amount spent on safety. One of the pieces of intuition behind that is that having more information about your competitors can cause you to either speed up or slow down depending on where they in fact are in relation to you.
Also seems like risk preferences are important here. If people are risk averse then having less information about the expected outcomes of their models makes them less likely to deploy them all else equal.
I think I’m most excited about 15, 16 and 6b because of a general worldview of 1) alignment is likely to be really hard and it seems like we’ll need assistance from the best aligned systems to solve the problem and 2) that ~all the risk comes from RL agents. Getting really really good microscope AI looks really good from this perspective, and potentially we need a co-ordinated movement towards microscope AI and away from RL models in which case building a really compelling case for why AGI is dangerous looks really important.
- Neel Nanda 7 Apr 2022 21:17 UTC
  2 points
  Parent
  Note that for interpretability to give you information on where you are relative to your competitors, you both need the tools to exist, and for AI companies to use the tools and publicly release the results. It’s pretty plausible to me that we get the first but not the second!
  - MichaelBowlby 7 Apr 2022 23:06 UTC
    1 point
    Parent
    Yeah that sounds very plausible. It also seems plausible that we get regulation about transparency, and in all the cases where the benefit from interpretability has something to do with people interacting you get the results being released at least semi-publicly. Industrial espionge also seems a worry. The USSR was hugely successful in infultrating the Manhatten project and contined to successfully steal US tech throughout the cold war.
    Also worth noting the more information about how good one’s own model is also increases AI risk in the papers model, although they model it as a discrete shift from no information to full information so unclear well that model applies.