Far and away the most common failure mode among self-identifying alignment researchers is to look for Clever Ways To Avoid Doing Hard Things (or Clever Reasons To Ignore The Hard Things), rather than just Directly Tackling The Hard Things
Regarding making use of a superintelligent AGI-system in a safe way, here are some things that possibly could help with that:
Working on making it so that the first superintelligent AGI is robustly aligned from the beginning.
Working on security measures so that if we didn’t do #1 (or thought we did, but didn’t), the AGI will be unable to “hack” itself out in some digital way (e.g. exploiting some OS flaw and getting internet access)
Developing and preparing techniques/strategies so that if we didn’t do #1 (or thought we did, but didn’t), we can obtain help from various instances of the AGI-system that get us towards a more and more aligned AGI-system, while (1) minimizing the causual influence of the AGI-systems and the ways they might manipulate us and (2) making requests in such a way that we can verify that what we are getting is what we actually want, greatly leveraging how verifying a system often is much easier than making it.
#2 and #3 seems to me as worth pursuing in addition to #1, but not instead of #1. Rather #2 and #3 could work as additional layers of alignment-assurance.
I do think genuine failure modes are being alluded to by “Clever Ways To Avoid Doing Hard Things”, but I think there also may be failure modes having to do with encouraging “everyone” to only work on “The Hard Things” in a direct way (without people also looking for potential workarounds and additional layers of alignment-assurance).
Also, consider if someone comes up with alignment methodologies for an AGI that don’t seem robust or fully safe, but do seem like they might have a decent/good chance of working in practice. Such alignment methodologies may be bad ideas if they are used as “the solution”, but if we have a “system of systems”, where some of the sub-systems themselves are AGIs that we have attempted to align based on different alignment methodologies, then we can e.g. see if the outputs from these different sub-systems converge.
Sincerely someone who does not call himself an alignment researcher, but who does self-identify as a “hobbyist alignment theorist”, and is working on a series where much of the focus is on Clever Techniques/Strategies That Might Work Even If We Haven’t Succeeded At The Hard Things (and thus maybe could provide additional layers of alignment-assurance).
Regarding making use of a superintelligent AGI-system in a safe way, here are some things that possibly could help with that:
Working on making it so that the first superintelligent AGI is robustly aligned from the beginning.
Working on security measures so that if we didn’t do #1 (or thought we did, but didn’t), the AGI will be unable to “hack” itself out in some digital way (e.g. exploiting some OS flaw and getting internet access)
Developing and preparing techniques/strategies so that if we didn’t do #1 (or thought we did, but didn’t), we can obtain help from various instances of the AGI-system that get us towards a more and more aligned AGI-system, while (1) minimizing the causual influence of the AGI-systems and the ways they might manipulate us and (2) making requests in such a way that we can verify that what we are getting is what we actually want, greatly leveraging how verifying a system often is much easier than making it.
#2 and #3 seems to me as worth pursuing in addition to #1, but not instead of #1. Rather #2 and #3 could work as additional layers of alignment-assurance.
I do think genuine failure modes are being alluded to by “Clever Ways To Avoid Doing Hard Things”, but I think there also may be failure modes having to do with encouraging “everyone” to only work on “The Hard Things” in a direct way (without people also looking for potential workarounds and additional layers of alignment-assurance).
Also, consider if someone comes up with alignment methodologies for an AGI that don’t seem robust or fully safe, but do seem like they might have a decent/good chance of working in practice. Such alignment methodologies may be bad ideas if they are used as “the solution”, but if we have a “system of systems”, where some of the sub-systems themselves are AGIs that we have attempted to align based on different alignment methodologies, then we can e.g. see if the outputs from these different sub-systems converge.
Sincerely someone who does not call himself an alignment researcher, but who does self-identify as a “hobbyist alignment theorist”, and is working on a series where much of the focus is on Clever Techniques/Strategies That Might Work Even If We Haven’t Succeeded At The Hard Things (and thus maybe could provide additional layers of alignment-assurance).