Typically, a definition for “Alignment” includes something like “systems that pursue objectives matching the ones intended by its creator(s).” and agents that are not aligned are “systems that pursue objectives other than the ones intended by its creator(s).” To be more clear though, throughout this article I’ll use the term “mal-alignment” as “systems that pursue objectives other than the ones humanity, broadly, would want, or that would be desirable.”
Mal-alignment is much more subjective because it relies on coming to agreement on what it is that humanity values, but most people can agree on some core values like those instilled in countries’ constitutions, or Maslov’s hierarchy of needs. To avoid an existential crisis—one that could wipe out humanity or lead to society’s collapse—it is not enough to train agents so that they always work towards their creators’ goals. Just to take a single example, someone like Ted Kaczynski, a now-deceased mathematics prodigy, would, in the future, be more than capable of creating a well-aligned AI agent. Of course, the agents’ goals, representative of Kaczynski’s goals, could be directly opposite to humanity’s goal of species-survival.
Unfortunately, due to the large number of people in the world and the increasing accessibility of technical systems used to build AI agents, it seems inevitable such a malevolent system will eventually be created, and because of this, I believe our engineers and policy makers should be focused on how to verify and monitor AI systems. There are various sub-topics that need research, policy outreach, implementation, and governance enforced:
Checking digital signatures and incorporating this into a verification protocol
Monitoring around the production of ML-capable chips
Monitoring for training data set downloads nefarious training data
A database of questions that could be used to confirm a system is not mal-aligned
Requirements for neural networks to be “interpretable” in their intermediate layers, possibly establishing a protocol for interrogation
Focusing on Mal-Alignment
Typically, a definition for “Alignment” includes something like “systems that pursue objectives matching the ones intended by its creator(s).” and agents that are not aligned are “systems that pursue objectives other than the ones intended by its creator(s).” To be more clear though, throughout this article I’ll use the term “mal-alignment” as “systems that pursue objectives other than the ones humanity, broadly, would want, or that would be desirable.”
Mal-alignment is much more subjective because it relies on coming to agreement on what it is that humanity values, but most people can agree on some core values like those instilled in countries’ constitutions, or Maslov’s hierarchy of needs. To avoid an existential crisis—one that could wipe out humanity or lead to society’s collapse—it is not enough to train agents so that they always work towards their creators’ goals. Just to take a single example, someone like Ted Kaczynski, a now-deceased mathematics prodigy, would, in the future, be more than capable of creating a well-aligned AI agent. Of course, the agents’ goals, representative of Kaczynski’s goals, could be directly opposite to humanity’s goal of species-survival.
Unfortunately, due to the large number of people in the world and the increasing accessibility of technical systems used to build AI agents, it seems inevitable such a malevolent system will eventually be created, and because of this, I believe our engineers and policy makers should be focused on how to verify and monitor AI systems. There are various sub-topics that need research, policy outreach, implementation, and governance enforced:
Checking digital signatures and incorporating this into a verification protocol
Monitoring around the production of ML-capable chips
Monitoring for training data set downloads nefarious training data
A database of questions that could be used to confirm a system is not mal-aligned
Requirements for neural networks to be “interpretable” in their intermediate layers, possibly establishing a protocol for interrogation