Johannes C. Mayer comments on Transparency for Generalizing Alignment from Toy Models

Johannes C. Mayer Apr 3, 2023, 12:51 AM
1 point
0
Somebody said that they would be skeptical that this would avoid the sharp left turn.

I should have said this more explicitly, but the idea is that this will avoid the sharp left turn if you can just develop deep enough intuitions about the system. You can then use these intuitions, to “do science” on the system and figure out how to iteratively make it more and more aligned. Not just by doing empirical experiments, but by building up good models of the system. And at each step, you can use these intuitions to verify that your alignment solution generalizes. That is the target.

These models are probably not just made up of intuitions. We want to have <formal/mathematical> models. However, I expect the hard part is to get the system into a form where it is easy to develop deep intuitions about it. Once we have that, I expect creating formal models based on these intuitions to be much easier than getting the system to the state where it is easy to have intuitions about it.
What links here?
- Transparency for Generalizing Alignment from Toy Models by Johannes C. Mayer (Apr 2, 2023, 10:47 AM; 13 points)
- plex Apr 3, 2023, 11:04 PM
  2 points
  0
  Parent
  that was me for context:
  core claim seems reasonable and worth testing, though I’m not very hopeful that it will reliably scale through the sharp left turn
  my guesses the intuitions don’t hold in the new domain, and radical superintelligence requires intuitions that you can’t develop on relatively weak systems, but it’s a source of data for our intuition models which might help with other stuff so seems reasonable to attempt.