It’s hard to pin down ambiguity around how much alignment “techniques” make models more “usable”, and how much that in turn enables more “scaling”. This and the safety-washing concern gets us into messy considerations. Though I generally agree that participants of MATS or AISC programs can cause much less harm through either than researchers working directly on aligning eg. OpenAI’s models for release.
Our crux though is about the extent of progress that can be made – on engineering fully autonomous machinery to control* their own effects in line with continued human safety. I agree with you that such a system can be engineered to start off performing more** of the tasks we want it to complete (ie. progress on alignment is possible). At the same time, there are fundamental limits to controllability (ie. progress on alignment is capped).
This is where I think we need more discussion:
Is the extent of AGI control possible at least more than the extent of control needed (to prevent eventual convergence on causing human extinction)?
* I use the term “control” in the established control theory sense, consistent with Yampolskiy’s definition. Just to avoid confusing people, as the term gets used in more specialised ways in the alignment community (eg. in conversations about the shut-down problem or control agenda). ** This is a rough way of stating it. It’s also about the machinery performing fewer of the tasks we wouldn’t want the system to complete. And the relevant measure is not as much about the number of preferred tasks performed, as the preferred consequences. Finally, this raises a question about who the ‘we’ is who can express preferences that the system is to act in line with, and whether coherent alignment with different persons’ preferences expressed from within different perceived contexts is even a sound concept.
Appreciating your thoughtful comment.
It’s hard to pin down ambiguity around how much alignment “techniques” make models more “usable”, and how much that in turn enables more “scaling”. This and the safety-washing concern gets us into messy considerations. Though I generally agree that participants of MATS or AISC programs can cause much less harm through either than researchers working directly on aligning eg. OpenAI’s models for release.
Our crux though is about the extent of progress that can be made – on engineering fully autonomous machinery to control* their own effects in line with continued human safety. I agree with you that such a system can be engineered to start off performing more** of the tasks we want it to complete (ie. progress on alignment is possible). At the same time, there are fundamental limits to controllability (ie. progress on alignment is capped).
This is where I think we need more discussion:
Is the extent of AGI control possible at least more than the extent of control needed
(to prevent eventual convergence on causing human extinction)?
* I use the term “control” in the established control theory sense, consistent with Yampolskiy’s definition. Just to avoid confusing people, as the term gets used in more specialised ways in the alignment community (eg. in conversations about the shut-down problem or control agenda).
** This is a rough way of stating it. It’s also about the machinery performing fewer of the tasks we wouldn’t want the system to complete. And the relevant measure is not as much about the number of preferred tasks performed, as the preferred consequences. Finally, this raises a question about who the ‘we’ is who can express preferences that the system is to act in line with, and whether coherent alignment with different persons’ preferences expressed from within different perceived contexts is even a sound concept.