(Going to try my hand at Builder, but this is admittedly vague, so I hope you help sharpen it with criticism.)
What if instead of a “reporter”, we had a “terrifier”, whose adversarial objective is to highlight the additional “sensor” whose observations, assuming the input and actions were held constant, when viewed by a human would maximize the probability of a human reviewer saying the system would not performing as desired. The terrifier would be allowed to run the original predictor model “further” in order to populate whichever new components of the Bayes net are necessarily to populate in order to correctly simulate the output of the sensor. The terrifier could learn what facts would scare the human via multiple rounds of curiosity-driven human questioning (which can probably be done offline/beforehand). Then, we ultimately approve of the model only if the most terrifying thing the AI can show the human is less terrifying than the best-possible answers to the originally-posed questions.
To be clear, “sensor” here is standing in for any possible set of human-understandable observations about the world. Of course, if a human-understandable “sensor” is too restrictive, we can require only that the sensed state be understandable by AI-enhanced humans.
For example, say that the burglar’s first action would be to tamper with the camera. But then, there must exist some hypothetical other location that a camera could be placed, that could still see the diamond being stolen—even though that new sensor is never actually revealed to the original simulator. If there are no possible “sensor”s we could add to the environment who results would terrify the human, that seems to me to be indistinguishable by definition from things just being OK. For example, if the thieves replace the diamond with an exact copy down to the nanometer but not molecular level… that’s fine? I consider that behavior aligned.
How this defeats the previous counterexample: in some sense, I see the problem with “ontology mismatch” as the human no longer knowing the right question to ask, because the AI and human no longer agree on an ontology that would make the human’s question meaningful. But “what question would the human be maximally upset about if they asked” is a pretty straightforward task to expect from an AI that’s capable of modeling (most) human preferences.
Why did I frame this as adding a “sensor” instead of proposing a new question? Because we’ve framed the “reporter” as being learned from examples. If we directly asked the reporter to tell us the most terrifying answer to a question it could think of, it would just tell us something terrifying even if it wasn’t “true” in the sense of being a “human translator”. However, new “sensors” would be simulated via the original predictor model, and thus we don’t have to rely on the translation step.
If the new concern is whether the AI’s ability to model human preferences well enough (to know what human-legible sensors would terrify us), that just seems like a much easier problem. It doesn’t take a genius AI to understand that a human wouldn’t want to live in Cakey’s surveillance state.
Thanks for your proposal! We have considered similar strategies in the past. The main points of the breaker response would be:
How are you training this “terrifier”? If the terrifier moves that can be made during training are from held-out sensors that you have constructed, then the “bad terrifier” you might learn doesn’t generalize to thinking of any new sensors, only ever considering the sensors that you used as ground truth during training. Thus a sufficiently clever robber might trick all of the sensors-used-during-training (which may not exist during deployment), and thus the diamond would be stolen without a terrifier being able to show any sensor data that would scare the human.
It’s possible that even if the human was able to know the results of all possible sensors, they still wouldn’t be able to know whether the situation was well. For instance, if the human doesn’t really understand how humans work, they way that they hope to maintain control over the situation is asking questions like “is that human the same human that I saw before?”. However, a terrifier might not have any sensor data that they could show the human to scare them, e.g. if the terrifier showed the human a bunch of nanotech in that humans brain, the human might not understand why that was bad.
For example, if the thieves replace the diamond with an exact copy down to the nanometer but not molecular level… that’s fine? I consider that behavior aligned.
This behavior seems potentially catastrophic. If all we can do is force potential adversaries to take actions that can’t be seen with the resolution of our current sensors, then e.g. we can’t prevent adversaries from corrupting code that is running on computers that we wish to protect.
How this defeats the previous counterexample: in some sense, I see the problem with “ontology mismatch” as the human no longer knowing the right question to ask, because the AI and human no longer agree on an ontology that would make the human’s question meaningful. But “what question would the human be maximally upset about if they asked” is a pretty straightforward task to expect from an AI that’s capable of modeling (most) human preferences.
I don’t really understand why this explains why your strategy defeats the previous counterexample.
(Going to try my hand at Builder, but this is admittedly vague, so I hope you help sharpen it with criticism.)
What if instead of a “reporter”, we had a “terrifier”, whose adversarial objective is to highlight the additional “sensor” whose observations, assuming the input and actions were held constant, when viewed by a human would maximize the probability of a human reviewer saying the system would not performing as desired. The terrifier would be allowed to run the original predictor model “further” in order to populate whichever new components of the Bayes net are necessarily to populate in order to correctly simulate the output of the sensor. The terrifier could learn what facts would scare the human via multiple rounds of curiosity-driven human questioning (which can probably be done offline/beforehand). Then, we ultimately approve of the model only if the most terrifying thing the AI can show the human is less terrifying than the best-possible answers to the originally-posed questions.
To be clear, “sensor” here is standing in for any possible set of human-understandable observations about the world. Of course, if a human-understandable “sensor” is too restrictive, we can require only that the sensed state be understandable by AI-enhanced humans.
For example, say that the burglar’s first action would be to tamper with the camera. But then, there must exist some hypothetical other location that a camera could be placed, that could still see the diamond being stolen—even though that new sensor is never actually revealed to the original simulator. If there are no possible “sensor”s we could add to the environment who results would terrify the human, that seems to me to be indistinguishable by definition from things just being OK. For example, if the thieves replace the diamond with an exact copy down to the nanometer but not molecular level… that’s fine? I consider that behavior aligned.
How this defeats the previous counterexample: in some sense, I see the problem with “ontology mismatch” as the human no longer knowing the right question to ask, because the AI and human no longer agree on an ontology that would make the human’s question meaningful. But “what question would the human be maximally upset about if they asked” is a pretty straightforward task to expect from an AI that’s capable of modeling (most) human preferences.
Why did I frame this as adding a “sensor” instead of proposing a new question? Because we’ve framed the “reporter” as being learned from examples. If we directly asked the reporter to tell us the most terrifying answer to a question it could think of, it would just tell us something terrifying even if it wasn’t “true” in the sense of being a “human translator”. However, new “sensors” would be simulated via the original predictor model, and thus we don’t have to rely on the translation step.
If the new concern is whether the AI’s ability to model human preferences well enough (to know what human-legible sensors would terrify us), that just seems like a much easier problem. It doesn’t take a genius AI to understand that a human wouldn’t want to live in Cakey’s surveillance state.
Very interested to hear the authors’ feedback.
Thanks for your proposal! We have considered similar strategies in the past. The main points of the breaker response would be:
How are you training this “terrifier”? If the terrifier moves that can be made during training are from held-out sensors that you have constructed, then the “bad terrifier” you might learn doesn’t generalize to thinking of any new sensors, only ever considering the sensors that you used as ground truth during training. Thus a sufficiently clever robber might trick all of the sensors-used-during-training (which may not exist during deployment), and thus the diamond would be stolen without a terrifier being able to show any sensor data that would scare the human.
It’s possible that even if the human was able to know the results of all possible sensors, they still wouldn’t be able to know whether the situation was well. For instance, if the human doesn’t really understand how humans work, they way that they hope to maintain control over the situation is asking questions like “is that human the same human that I saw before?”. However, a terrifier might not have any sensor data that they could show the human to scare them, e.g. if the terrifier showed the human a bunch of nanotech in that humans brain, the human might not understand why that was bad.
This behavior seems potentially catastrophic. If all we can do is force potential adversaries to take actions that can’t be seen with the resolution of our current sensors, then e.g. we can’t prevent adversaries from corrupting code that is running on computers that we wish to protect.
I don’t really understand why this explains why your strategy defeats the previous counterexample.