Adele Lopez comments on Discussion with Nate Soares on a key alignment difficulty

Adele Lopez 14 Mar 2023 6:22 UTC
LW: 13 AF: 3
12
AF
It seems plausible to me that there could be non CIS-y AIs which could nonetheless be very helpful. For example, take the example approach you suggested:

(This might take the form of e.g. doing more interpretability work similar to what’s been done, at great scale, and then synthesizing/distilling insights from this work and iterating on that to the point where it can meaningfully “reverse-engineer” itself and provide a version of itself that humans can much more easily modify to be safe, or something.)

I wouldn’t feel that surprised if greatly scaling the application of just current insights rapidly increased the ability of the researchers capable of “moving the needle” to synthesize and form new insights from these themselves (and that an AI trained on this specific task could do without much CIS-ness). I’m curious as to whether this sort of thing seems plausible to both you and Nate!

Assuming that could work, it then seems plausible that you could iterate this a few times while still having all the “out of distribution” work being done by humans.
- Alexander Gietelink Oldenziel 14 Mar 2023 20:29 UTC
  3 points
  0
  Parent
  Yes this seems clearly true.
  Tho I would think that Nate’s a too subtle thinker to think that AI assistance is literally useless—just that most of the ‘hardest work’ is not easily automatable which seems pretty valid.
  i.e. in my reading most of the hard work of alignment is finding good formalizations of informal intuitions. I’m pretty bullish on future AI assistants helping, especially proof assistants, but this doesn’t seem to be a case where you can simply prompt gpt-3 scaled x1000 or something silly like that. I understand Nate thinks that if it could do that it is secretly doing dangerous things.