This is the second in a series of articles about applying traditional security mindset to the problems of alignment and AI research in general. As much as possible, we should try to mine the lessons from the history of security and apply them to the alignment problem. To the extent that security is orthogonal to alignment research, good security practices and leveraging existing security capabilities should still help extend timelines, hopefully long enough to allow time for alignment research to make sufficient progress. At the very least, it would be undignified for existential risk to be realized via a basic, preventable security failure.
Fire Alarms
There may be no fire alarm for imminent AGI, but what kind of useful fire alarms could we build with what we know today? What lessons can we draw from fire alarm design and use that could apply to an AGI early warning system? If we have a goal of being able to take some useful action, like leaving the building before being immolated by an unaligned superoptimizer of our own creation, then what kind of alarm features can help us achieve that?
Fire Alarm Detection Signatures and Features
Most smoke alarms work by detecting either disruption in the flow of ionized particles (good for fast-burning fires), or a beam of light (slow, smoldering fires). Some fire suppression systems are also triggered by heat. Some detectors combine a photoelectric and ionization signal for even more reliable signature detection.
What properties can we generalize from this to our AI safety system? Multiple diverse signals are good, if you don’t know exactly what type of fire to expect. Each signal type is good at its own particular domain of detection, but has its own drawbacks.
Detection is automated. It does not require a human to overcome inattention blindness. The alarm cannot be bribed or socially engineered. It does not have cognitive bias. The alarm signal is clear and unambiguous; binary.
Fire Alarm Response Procedure
Once we’ve decided what signatures to alert on, what should people actually do when they hear the alarm? KatjaGrace has already written extensively about the psychology of alarm response. We’ll assume for this exercise that if those who are responsible for responding to the alarm have participated in its creation, then most of the social proof aspects won’t apply to the response.
Most people have fire drill fatigue from over-practicing response procedure, which is a failure mode we should try to avoid—having an alarm signal that is dutifully ignored by researchers is worse than not having one in the first place! Fire drills are often scheduled in advance, so people learn to avoid them altogether. The ritual itself loses value rapidly after the first practice—there is very little variability in a building’s escape routes or other features that would be useful to practice.
A good fire drill should either be extremely easy to practice and not disruptive, or rare enough that some benefit could still be achieved without causing too much pain. Ideally, people should look forward to practicing the response procedure, or have some positive ritual or even reward around it. Each drill should try to practice some unique scenario or response tactic.
So what would a good response actually entail? The ultimate fire alarm should cause us to hit the big red button to stop all AGI capability research. If no other earlier warning systems exist, there should at least be this final option for response. Unfortunately, this button is extremely difficult to build (it is at least approximately a pivotal act). A less drastic, more incremental response should take a weaker alarm signal and require a much less impactful response, similar to a PCI compliance threshold: if you process this many transactions, then you now have to start taking these additional safety precautions.
Fire Alarm Maintenance and Testing
Many smoke alarms require battery replacement and have a prominent button used for testing. This regular maintenance schedule helps ensure that the detection mechanism is still working properly and hasn’t failed silently. (Most smoke alarm owners can attest that smoke alarms usually DON’T fail silently—their death rattle is an incessant, unignorable chirping.) Some jurisdictions even require independent 3rd-party testing, even for home/personal devices.
The test procedure to ensure proper functioning should be simple, prominent, easy for any person to initiate, non-disruptive, and easily auditable.
Fire Alarm Adversaries
The major difference in our AGI fire alarm metaphor, is that fire alarms aren’t generally being designed to detect fires that people are explicitly trying to start. There is a big difference in signal detection and response for an intelligent adversary that may be trying to defeat your security control, and a passive, naturalistic, unthinking one.
One exception is the airplane bathroom smoke detector—this alarm does have to survive a fire that people are trying to start intentionally. In this case, federal law does prohibit tampering with or disabling the detector, with heavy fines. Certainly, similar regulation could apply to AI safety controls. This approach has resulted in various degrees of success, looking at diverse regulated industries with respect to pollution standards, financial instruments, or biotech. Certainly we shouldn’t rely on regulation and fines alone to ensure the proper functioning of the detection apparatus.
Next Steps
This examination has raised more questions than it has answered, but hopefully provides a strongly suggestive direction for additional discussions, research, and action. These are candidates for future posts in the series:
Fire alarm measurements—Peter McCluskey has already proposed a good number of alarm scenarios as varied signals. This is a great start. This list could be expanded with additional brainstorming as well as filtered through the lenses of what makes a good fire alarm and what lead researchers are likely to agree about being a significant signal to change behavior.
Regulatory schemes—AI legislation is being drafted in the US and EU—is it effective at addressing existential risks? Does it change the incentive structure enough to drive sufficient behavior changes? Does it simply shift the risks toward research in other jurisdictions?
Industry self-regulation - PCI DSS (a self-regulatory framework) was arguably responsible for more significant changes in corporate security behavior than any other traditional regulatory initiative. What would an industry self-regulation standard look like?
Incentive models—What incentive structures could we put in place to change research and corporate behavior? Could these incentives be influenced by private entities or does that require collective / democratic action for moral or practical reasons?
Security Mindset—Fire Alarms and Trigger Signatures
Series Overview and Goals
This is the second in a series of articles about applying traditional security mindset to the problems of alignment and AI research in general. As much as possible, we should try to mine the lessons from the history of security and apply them to the alignment problem. To the extent that security is orthogonal to alignment research, good security practices and leveraging existing security capabilities should still help extend timelines, hopefully long enough to allow time for alignment research to make sufficient progress. At the very least, it would be undignified for existential risk to be realized via a basic, preventable security failure.
Fire Alarms
There may be no fire alarm for imminent AGI, but what kind of useful fire alarms could we build with what we know today? What lessons can we draw from fire alarm design and use that could apply to an AGI early warning system? If we have a goal of being able to take some useful action, like leaving the building before being immolated by an unaligned superoptimizer of our own creation, then what kind of alarm features can help us achieve that?
Fire Alarm Detection Signatures and Features
Most smoke alarms work by detecting either disruption in the flow of ionized particles (good for fast-burning fires), or a beam of light (slow, smoldering fires). Some fire suppression systems are also triggered by heat. Some detectors combine a photoelectric and ionization signal for even more reliable signature detection.
What properties can we generalize from this to our AI safety system? Multiple diverse signals are good, if you don’t know exactly what type of fire to expect. Each signal type is good at its own particular domain of detection, but has its own drawbacks.
Detection is automated. It does not require a human to overcome inattention blindness. The alarm cannot be bribed or socially engineered. It does not have cognitive bias. The alarm signal is clear and unambiguous; binary.
Fire Alarm Response Procedure
Once we’ve decided what signatures to alert on, what should people actually do when they hear the alarm? KatjaGrace has already written extensively about the psychology of alarm response. We’ll assume for this exercise that if those who are responsible for responding to the alarm have participated in its creation, then most of the social proof aspects won’t apply to the response.
Most people have fire drill fatigue from over-practicing response procedure, which is a failure mode we should try to avoid—having an alarm signal that is dutifully ignored by researchers is worse than not having one in the first place! Fire drills are often scheduled in advance, so people learn to avoid them altogether. The ritual itself loses value rapidly after the first practice—there is very little variability in a building’s escape routes or other features that would be useful to practice.
A good fire drill should either be extremely easy to practice and not disruptive, or rare enough that some benefit could still be achieved without causing too much pain. Ideally, people should look forward to practicing the response procedure, or have some positive ritual or even reward around it. Each drill should try to practice some unique scenario or response tactic.
So what would a good response actually entail? The ultimate fire alarm should cause us to hit the big red button to stop all AGI capability research. If no other earlier warning systems exist, there should at least be this final option for response. Unfortunately, this button is extremely difficult to build (it is at least approximately a pivotal act). A less drastic, more incremental response should take a weaker alarm signal and require a much less impactful response, similar to a PCI compliance threshold: if you process this many transactions, then you now have to start taking these additional safety precautions.
Fire Alarm Maintenance and Testing
Many smoke alarms require battery replacement and have a prominent button used for testing. This regular maintenance schedule helps ensure that the detection mechanism is still working properly and hasn’t failed silently. (Most smoke alarm owners can attest that smoke alarms usually DON’T fail silently—their death rattle is an incessant, unignorable chirping.) Some jurisdictions even require independent 3rd-party testing, even for home/personal devices.
The test procedure to ensure proper functioning should be simple, prominent, easy for any person to initiate, non-disruptive, and easily auditable.
Fire Alarm Adversaries
The major difference in our AGI fire alarm metaphor, is that fire alarms aren’t generally being designed to detect fires that people are explicitly trying to start. There is a big difference in signal detection and response for an intelligent adversary that may be trying to defeat your security control, and a passive, naturalistic, unthinking one.
One exception is the airplane bathroom smoke detector—this alarm does have to survive a fire that people are trying to start intentionally. In this case, federal law does prohibit tampering with or disabling the detector, with heavy fines. Certainly, similar regulation could apply to AI safety controls. This approach has resulted in various degrees of success, looking at diverse regulated industries with respect to pollution standards, financial instruments, or biotech. Certainly we shouldn’t rely on regulation and fines alone to ensure the proper functioning of the detection apparatus.
Next Steps
This examination has raised more questions than it has answered, but hopefully provides a strongly suggestive direction for additional discussions, research, and action. These are candidates for future posts in the series:
Fire alarm measurements—Peter McCluskey has already proposed a good number of alarm scenarios as varied signals. This is a great start. This list could be expanded with additional brainstorming as well as filtered through the lenses of what makes a good fire alarm and what lead researchers are likely to agree about being a significant signal to change behavior.
Regulatory schemes—AI legislation is being drafted in the US and EU—is it effective at addressing existential risks? Does it change the incentive structure enough to drive sufficient behavior changes? Does it simply shift the risks toward research in other jurisdictions?
Industry self-regulation - PCI DSS (a self-regulatory framework) was arguably responsible for more significant changes in corporate security behavior than any other traditional regulatory initiative. What would an industry self-regulation standard look like?
Incentive models—What incentive structures could we put in place to change research and corporate behavior? Could these incentives be influenced by private entities or does that require collective / democratic action for moral or practical reasons?