Thanks for the thoughtful response, although I’m not sure quite of the approach. For starters, ‘aligning humans’ takes a long time and we may simply not have time to test any proposed alignment scheme on humans if we want to avoid AGI misalignment catastrophe. Not to mention ethical issues and so forth.
Society has been working on human alignment for a very long time, and we’ve settled on a dual approach: (1) training the model in particular ways (e.g. parenting, schooling etc.) and (2) a very complex, likely suboptimal system of ‘checks and balances’ to try and mitigate issues that occur for whatever reason post-training, e.g. the criminal justice system, national and foreign intelligence, etc. This is clearly imperfect, but it seems to work a lot better to maintain cohesion than prior systems of ‘mob justice’ where if you didn’t like someone you’d just club them in the head. Unfortunately, we now also have much more destructive weapons than the clubs of the ‘mob justice’ era. Nonetheless, as of today you and I are still alive and significant chunks of society more or less function and continue to do so over extended periods of time, so the equilibrium we’re in could be a lot worse, but per my original post re. the nuclear button it’s clear we have ended up on a knife edge.
Fortunately, we do have a powerful advantage with AGI alignment over human alignment: for the former we can inspect the internal state (weights etc.). Interpretability of the internal state is of course a challenge, but we nonetheless have access. (The reason why human alignment requires a criminal justice system like we have, with all its complexity and failings, is largely because we cannot inspect the internal state.) So it seems that AGI alignment may well be achieved through a combination of the right approach during the training phase, and a subsequent ‘check’ methodology via a continuous analysis of the internal state. I believe that bringing the large body of understanding/research/data we already have on human alignment in the training phase (i.e. psychology and parenting) to the AI safety researcher’s table may be very helpful. And right now, I don’t see this. Look for example at the open recs posted on the OpenAI web site—they are all ‘nerd’ jobs and no experts of human behaviour. This surprises me and I don’t really understand it. If AGI alignment is as important to us as we claim, we should be more proactive in bringing experts from other disciplines into the fold when there’s a reasonable argument for it, not just more and more computer scientists.
Thanks for the thoughtful response, although I’m not sure quite of the approach. For starters, ‘aligning humans’ takes a long time and we may simply not have time to test any proposed alignment scheme on humans if we want to avoid AGI misalignment catastrophe. Not to mention ethical issues and so forth.
Society has been working on human alignment for a very long time, and we’ve settled on a dual approach: (1) training the model in particular ways (e.g. parenting, schooling etc.) and (2) a very complex, likely suboptimal system of ‘checks and balances’ to try and mitigate issues that occur for whatever reason post-training, e.g. the criminal justice system, national and foreign intelligence, etc. This is clearly imperfect, but it seems to work a lot better to maintain cohesion than prior systems of ‘mob justice’ where if you didn’t like someone you’d just club them in the head. Unfortunately, we now also have much more destructive weapons than the clubs of the ‘mob justice’ era. Nonetheless, as of today you and I are still alive and significant chunks of society more or less function and continue to do so over extended periods of time, so the equilibrium we’re in could be a lot worse, but per my original post re. the nuclear button it’s clear we have ended up on a knife edge.
Fortunately, we do have a powerful advantage with AGI alignment over human alignment: for the former we can inspect the internal state (weights etc.). Interpretability of the internal state is of course a challenge, but we nonetheless have access. (The reason why human alignment requires a criminal justice system like we have, with all its complexity and failings, is largely because we cannot inspect the internal state.) So it seems that AGI alignment may well be achieved through a combination of the right approach during the training phase, and a subsequent ‘check’ methodology via a continuous analysis of the internal state. I believe that bringing the large body of understanding/research/data we already have on human alignment in the training phase (i.e. psychology and parenting) to the AI safety researcher’s table may be very helpful. And right now, I don’t see this. Look for example at the open recs posted on the OpenAI web site—they are all ‘nerd’ jobs and no experts of human behaviour. This surprises me and I don’t really understand it. If AGI alignment is as important to us as we claim, we should be more proactive in bringing experts from other disciplines into the fold when there’s a reasonable argument for it, not just more and more computer scientists.