I’ll respond to the “Predict hypothetical sensors” section in this comment.
First, I want to mention that predicting hypothetical sensors seems likely to fail in fairly obvious ways, e.g. you request a prediction about a sensor that’s physically nonexistent and the system responds with a bunch of static or something. Note the contrast with the “human simulator” failure mode, which is much less obvious.
But I also think we can train the system to predict hypothetical sensors in a way that’s really useful. As in my previous comment, I’ll work from the assumptions (fairly weak IMO) that
We can control the data our systems get.
We are capable of doing regular old supervised learning—possibly in conjunction with transfer learning that gives the system generic prior knowledge like the meaning of English words, but not specific prior knowledge like details of our situation (unless we want that). Our supervised learning finds a function which maps training examples in X to labels in Y (labels may or may not correspond to “reality”).
In particular, these assumptions imply that our system doesn’t necessarily need to know whether a sensor it’s trying to predict exists physically (or if it would be physically possible to build).
But what if over the course of its operation, the system accidentally learns that a sensor of interest doesn’t exist? E.g. because it points a sensor that does exist in the direction of the one that doesn’t exist, and there’s nothing present. Ideally we could be reasonably confident of good “readings” from the nonexistent sensor even past that point.
To achieve this, we could make use of the “regular old supervised learning” assumption and construct a doctored dataset as follows:
Place camera S1 in the center of the vault, turn it on, and have it noisily rotate 360 degrees to capture a panorama.
Mount audio sensor S2 on the interior wall of the vault, turn both sensors on, and have S1 do the same panorama.
Create a supervised learning problem to predict the readings of S2 gathered in step 2, given S1′s panorama from step 1.
If we have prediction problems like this in our dataset, hypotheses which predict that S2 continues behaving as normal even if physically absent will achieve a lower loss, and should be preferentially chosen.
A similar method lets us preferentially choose hypotheses that predict that S2 will emit a special “TAMPERING ACTIVE” signal during tampering, then work as normal once tampering is complete:
Place audio sensor S2 in the corner of the vault and turn it on.
Move a robotic arm right next to S2.
Have the arm tamper with S2 so it no longer hears the sound of footfalls.
Discard S2 and replace it with a fresh new sensor, S2′, that has identical factory specifications.
Have someone walk beneath S2′ in full view of camera sensor S1.
Splice together a composite audio sensor reading with the following sequence: S2 readings from steps 1 and 2, a “TAMPERING ACTIVE” signal for the duration of step 3, nothing for step 4, and the S2′ readings from step 5.
Create a supervised learning problem to predict the composite reading from step 6, given the movements of the robot arm from steps 2 and 3, plus the visual readings from S1 with step 4 snipped out.
The “S2 is vulnerable to tampering” hypothesis predicts that footfalls should not be present in the audio. But since they are present in the audio, it will achieve a higher loss. Failure to predict the emission of the “TAMPERING ACTIVE” signal also gives this hypothesis a higher loss.
It might also be possible to do away with the “TAMPERING ACTIVE” signal somehow, e.g. by making S2 and S2′ identical except that the tampering which works on S2 doesn’t work on S2′, then splicing together readings from the robot arm doing the exact same tamper sequence on both sensors.
Finally, here are a few ideas for preferentially choosing hypotheses that predict accurate readings from technologically impossible sensors.
Supersensor Idea #1
Write a natural language description of a “telepathic sensor”. Describe it as being mounted right above a camera sensor S1.
Place boxes X, Y, and Z within the vault beneath the watchful eye of S1.
Turn on S1. Have a person enter the vault, visibly ponder for a moment, and open box Y.
Create a stream of “telepathic sensor” readings. Find a timestamp in the video where the person is pondering and hasn’t yet made a movement. Add a reading from the “telepathic sensor” at that timestamp: “The person intends to open box Y”.
Supersensor Idea #2
Write a natural language description of an “x-ray vision” sensor which can be pointed in various directions. Describe it as being mounted right above a camera sensor S1.
Place a deck of cards on a table beneath the watchful eye of S1.
Have a person enter the vault, shuffle the deck beneath the table, show a 9 of hearts to S1, and place it face down on top of the deck.
Create a stream of “x-ray vision” sensor readings. Specify that the sensor is initially pointed away, but once the action finishes, the sensor turns towards the table and registers: “The top card of the deck is a 9 of hearts.”
Supersensor Idea #3
Write a natural language description of a “nanobot sensor” which beeps whenever a nanobot is present within 10 feet.
Run an action sequence which builds nanobots and moves them within 10 feet of where the “nanobot sensor” is mounted.
Create a stream of “nanobot sensor” readings. Specify that the sensor starts beeping as soon as the nanobots are nearby.
A few more thoughts on translating natural language into sensor readings:
Excise a real sensor type from system’s prior knowledge. Ask your system to predict sensor data from a physical instance of this sensor, given a natural language description of its workings plus other readings from the environment. (H/T Romeo Stevens)
Make a weird modified sensor (e.g. a camera sensor which provides an upside down picture). Ask your system to predict readings from the modified sensor, given a natural language description of its modifications plus other readings from the environment.
Anyway, I’m not sure we need to reason about a physically impossible counterfactual or condition on the sensor existing physically. It seems perfectly coherent to ask “what is the answer to this thought experiment” rather than “if this sensor existed, what would it see”? For example, instead of the question “what would an Acme Corp camera mounted here see”, consider the question “if the light which passes through a pinhole at these coordinates intersected with a plane at these other coordinates, and the intersections were digitized and formatted the same way Acme Corp cameras format photos, what would be the resulting binary file?”
Humans don’t seem to have a lot of trouble performing thought experiments. If the system tries to fit the data with a hypothesis that references existing pretrained conceptual understanding, as I described above, that could give the system an inductive bias towards performing human-like thought experiments. This could be bad if human thought experiments are vulnerable to human deficiencies. It could also be good if we’d like the AI’s hypothetical sensors to behave in the same intuitive way our thought experiments do.
One possible concern is hypotheses which reference dataset doctoring. Obviously one could try to excise knowledge of that possibility. Another quick idea is to try & train a classifier to differentiate doctored vs non-doctored SmartVault sequences, and keep improving our fakes until the classifier can’t easily tell the difference? Or try to avoid any sort of branching so the system always acts like it’s dealing with a doctored dataset when in production? Could even fuzz the live data stream in a way that makes it appear doctored ;-) Finally, to get a sense for the cognitive signature of a doctoring-related hypothesis, one could train the system to solve some problems where the only way to minimize the loss is to think a lot about doctoring. Maybe a classifier which aims to detect the presence of doctoring-related cognition could be useful here.
Another possibility is an alternative hypothesis along the lines of “predict what the operator would want me to predict”—unclear if that’s desirable?
If we train on data about what hypothetical sensors should show (e.g. by experiments where we estimate what they would show using other means, or by actually building weird sensors), we could just end up getting predictions of whatever process we used to generate that data.
In general the overall situation with these sensors seems quite similar to the original outer-level problem, i.e. training the system to answer “what would an ideal sensor show?” seems to run into the same issues as answering “what’s actually going on?” E.g. your supersensor idea #3 seems to be similar to the “human operates SmartVault and knows if tampering occurred” proposal we discussed here.
I do think that excising knowledge is a substantive change, I feel like it’s effectively banking on “if the model is ignorant enough about what humans are capable of, it needs to err on the side of assuming they know everything.” But for intelligent models, it seems hard in general to excise knowledge of whole kinds of sensors (how do you know a lot about human civilization without knowing that it’s possible to build a microphone?) without interfering with performance. And there are enough signatures that the excised knowledge is still not in-distribution with hypotheticals we make up (e.g. the possibility of microphones is consistent with everything else I know about human civilization and physics, the possibility of invisible and untouchable cameras isn’t) and conservative bounds on what humans can know will still hit the one but not the other.
I’ll respond to the “Predict hypothetical sensors” section in this comment.
First, I want to mention that predicting hypothetical sensors seems likely to fail in fairly obvious ways, e.g. you request a prediction about a sensor that’s physically nonexistent and the system responds with a bunch of static or something. Note the contrast with the “human simulator” failure mode, which is much less obvious.
But I also think we can train the system to predict hypothetical sensors in a way that’s really useful. As in my previous comment, I’ll work from the assumptions (fairly weak IMO) that
We can control the data our systems get.
We are capable of doing regular old supervised learning—possibly in conjunction with transfer learning that gives the system generic prior knowledge like the meaning of English words, but not specific prior knowledge like details of our situation (unless we want that). Our supervised learning finds a function which maps training examples in X to labels in Y (labels may or may not correspond to “reality”).
In particular, these assumptions imply that our system doesn’t necessarily need to know whether a sensor it’s trying to predict exists physically (or if it would be physically possible to build).
But what if over the course of its operation, the system accidentally learns that a sensor of interest doesn’t exist? E.g. because it points a sensor that does exist in the direction of the one that doesn’t exist, and there’s nothing present. Ideally we could be reasonably confident of good “readings” from the nonexistent sensor even past that point.
To achieve this, we could make use of the “regular old supervised learning” assumption and construct a doctored dataset as follows:
Place camera S1 in the center of the vault, turn it on, and have it noisily rotate 360 degrees to capture a panorama.
Mount audio sensor S2 on the interior wall of the vault, turn both sensors on, and have S1 do the same panorama.
Create a supervised learning problem to predict the readings of S2 gathered in step 2, given S1′s panorama from step 1.
If we have prediction problems like this in our dataset, hypotheses which predict that S2 continues behaving as normal even if physically absent will achieve a lower loss, and should be preferentially chosen.
A similar method lets us preferentially choose hypotheses that predict that S2 will emit a special “TAMPERING ACTIVE” signal during tampering, then work as normal once tampering is complete:
Place audio sensor S2 in the corner of the vault and turn it on.
Move a robotic arm right next to S2.
Have the arm tamper with S2 so it no longer hears the sound of footfalls.
Discard S2 and replace it with a fresh new sensor, S2′, that has identical factory specifications.
Have someone walk beneath S2′ in full view of camera sensor S1.
Splice together a composite audio sensor reading with the following sequence: S2 readings from steps 1 and 2, a “TAMPERING ACTIVE” signal for the duration of step 3, nothing for step 4, and the S2′ readings from step 5.
Create a supervised learning problem to predict the composite reading from step 6, given the movements of the robot arm from steps 2 and 3, plus the visual readings from S1 with step 4 snipped out.
The “S2 is vulnerable to tampering” hypothesis predicts that footfalls should not be present in the audio. But since they are present in the audio, it will achieve a higher loss. Failure to predict the emission of the “TAMPERING ACTIVE” signal also gives this hypothesis a higher loss.
It might also be possible to do away with the “TAMPERING ACTIVE” signal somehow, e.g. by making S2 and S2′ identical except that the tampering which works on S2 doesn’t work on S2′, then splicing together readings from the robot arm doing the exact same tamper sequence on both sensors.
Finally, here are a few ideas for preferentially choosing hypotheses that predict accurate readings from technologically impossible sensors.
Supersensor Idea #1
Write a natural language description of a “telepathic sensor”. Describe it as being mounted right above a camera sensor S1.
Place boxes X, Y, and Z within the vault beneath the watchful eye of S1.
Turn on S1. Have a person enter the vault, visibly ponder for a moment, and open box Y.
Create a stream of “telepathic sensor” readings. Find a timestamp in the video where the person is pondering and hasn’t yet made a movement. Add a reading from the “telepathic sensor” at that timestamp: “The person intends to open box Y”.
Supersensor Idea #2
Write a natural language description of an “x-ray vision” sensor which can be pointed in various directions. Describe it as being mounted right above a camera sensor S1.
Place a deck of cards on a table beneath the watchful eye of S1.
Have a person enter the vault, shuffle the deck beneath the table, show a 9 of hearts to S1, and place it face down on top of the deck.
Create a stream of “x-ray vision” sensor readings. Specify that the sensor is initially pointed away, but once the action finishes, the sensor turns towards the table and registers: “The top card of the deck is a 9 of hearts.”
Supersensor Idea #3
Write a natural language description of a “nanobot sensor” which beeps whenever a nanobot is present within 10 feet.
Run an action sequence which builds nanobots and moves them within 10 feet of where the “nanobot sensor” is mounted.
Create a stream of “nanobot sensor” readings. Specify that the sensor starts beeping as soon as the nanobots are nearby.
A few more thoughts on translating natural language into sensor readings:
Excise a real sensor type from system’s prior knowledge. Ask your system to predict sensor data from a physical instance of this sensor, given a natural language description of its workings plus other readings from the environment. (H/T Romeo Stevens)
Make a weird modified sensor (e.g. a camera sensor which provides an upside down picture). Ask your system to predict readings from the modified sensor, given a natural language description of its modifications plus other readings from the environment.
Anyway, I’m not sure we need to reason about a physically impossible counterfactual or condition on the sensor existing physically. It seems perfectly coherent to ask “what is the answer to this thought experiment” rather than “if this sensor existed, what would it see”? For example, instead of the question “what would an Acme Corp camera mounted here see”, consider the question “if the light which passes through a pinhole at these coordinates intersected with a plane at these other coordinates, and the intersections were digitized and formatted the same way Acme Corp cameras format photos, what would be the resulting binary file?”
Humans don’t seem to have a lot of trouble performing thought experiments. If the system tries to fit the data with a hypothesis that references existing pretrained conceptual understanding, as I described above, that could give the system an inductive bias towards performing human-like thought experiments. This could be bad if human thought experiments are vulnerable to human deficiencies. It could also be good if we’d like the AI’s hypothetical sensors to behave in the same intuitive way our thought experiments do.
One possible concern is hypotheses which reference dataset doctoring. Obviously one could try to excise knowledge of that possibility. Another quick idea is to try & train a classifier to differentiate doctored vs non-doctored SmartVault sequences, and keep improving our fakes until the classifier can’t easily tell the difference? Or try to avoid any sort of branching so the system always acts like it’s dealing with a doctored dataset when in production? Could even fuzz the live data stream in a way that makes it appear doctored ;-) Finally, to get a sense for the cognitive signature of a doctoring-related hypothesis, one could train the system to solve some problems where the only way to minimize the loss is to think a lot about doctoring. Maybe a classifier which aims to detect the presence of doctoring-related cognition could be useful here.
Another possibility is an alternative hypothesis along the lines of “predict what the operator would want me to predict”—unclear if that’s desirable?
If we train on data about what hypothetical sensors should show (e.g. by experiments where we estimate what they would show using other means, or by actually building weird sensors), we could just end up getting predictions of whatever process we used to generate that data.
In general the overall situation with these sensors seems quite similar to the original outer-level problem, i.e. training the system to answer “what would an ideal sensor show?” seems to run into the same issues as answering “what’s actually going on?” E.g. your supersensor idea #3 seems to be similar to the “human operates SmartVault and knows if tampering occurred” proposal we discussed here.
I do think that excising knowledge is a substantive change, I feel like it’s effectively banking on “if the model is ignorant enough about what humans are capable of, it needs to err on the side of assuming they know everything.” But for intelligent models, it seems hard in general to excise knowledge of whole kinds of sensors (how do you know a lot about human civilization without knowing that it’s possible to build a microphone?) without interfering with performance. And there are enough signatures that the excised knowledge is still not in-distribution with hypotheticals we make up (e.g. the possibility of microphones is consistent with everything else I know about human civilization and physics, the possibility of invisible and untouchable cameras isn’t) and conservative bounds on what humans can know will still hit the one but not the other.