Thanks for the reply! I’ll respond to the “Hold out sensors” section in this comment.
One assumption which seems fairly safe in my mind is that as the operators, we have control over the data our AI gets. (Another way of thinking about it is if we don’t have control over the data our AI gets, the game has already been lost.)
Given that assumption, this problem seems potentially solvable
Moreover, my AI may be able to deduce the presence of the additional sensors very cheaply. Perhaps it can notice the sensors, or it can learn about my past actions to get a hint about where I may have placed them. If this is possible, then “Predict the readings of all sensors” need not be much more complex than “Predict the readings of one sensor.”
If the SmartVault learns a policy from data which was all generated prior to the installation of the new sensors, it seems unlikely that policy would deliberately account for the existence of those specific new sensors. We could leave info about your past actions out of the dataset as well.
An alternative approach in response to this bit:
I’m concerned that I will learn a better version of the human simulator which predicts the readings of all sensors and then outputs what a human would infer from the complete set.
The scenario is: we’re learning a function F1(A, S1) → D where A is an action sequence, S1 is readings from the known sensor, and D is a diamond location. Previously we’ve discussed two functions which both achieve perfect loss on our training data:
D1(A, S1) -- a direct translator which takes A and S1 into account
H1(A, S1) -- a simulation of what a human would believe if they saw A and S1
Let’s also consider two other functions:
D2(A, S1, S2) -- a direct translator which takes A, S1, and S2 into account
H2(A, S1, S2) -- a simulation of what a human would believe if they saw A, S1, and S2
Your concern is that there is a third function on the original (A, S1) domain which also achieves perfect loss:
H1′(A, S1) = H2(A, S1, P_S2(A, S1)) -- defining P_S2 as a prediction of S2′s readings given A & S1, we have H1′ as a simulation of what a human would believe if they saw A, S1, and readings for S2 predicted from A & S1.
Why would it be bad if gradient descent discovered H1′? Because then when it comes time to learn a policy, we incentivize policies which deceive predicted readings for S2 in addition to S1.
Here’s an idea for obtaining a function on the original (A, S1) domain which does not incentivize policies which deceive S2:
Learn a function F2 on the expanded domain (A, S1, S2), using a training set which is carefully constructed so that the only way to achieve perfect loss is to do a good job of taking readings from S2 into account. (For example, deliberately construct scenarios where the readings from S2 are not what you would expect if you were only looking at A and S1, and make ignoring A & S1 in favor of S2 key to labeling those scenarios correctly.) F2 could be closer to either D2 or H2, I don’t think it matters much.
Define a function F1_only(A, S1) = F2(A, S1, <hardcoded stream of boring S2 sensor data>).
Now let’s use F1_only as the target for learning our policy. I argue a policy has no incentive to deceive S2, because we know that F2 has been optimized to trust its S2 argument over its A and S1 arguments regarding what is going on around S2, and when we call F2 through F1_only, its S2 argument will always be telling it there are no interesting readings coming from S2. So, no bonus points for a policy which tries to fool S2 in addition to S1.
Maybe there is some kind of unintended consequence to this weird setup; I just came up with it and it’s still a bit half-baked in my mind. (Perhaps you could make some kind of exotic argument on the basis of inner optimizers and acausal trade between different system components?) But the meta point is there’s a lot of room for creativity if you don’t anthropomorphize and just think in terms of learning functions on datasets. I think the consequences of the “we control the data our AIs get” assumption could be pretty big if you’re willing to grant it.
Thanks for the reply! I’ll respond to the “Hold out sensors” section in this comment.
One assumption which seems fairly safe in my mind is that as the operators, we have control over the data our AI gets. (Another way of thinking about it is if we don’t have control over the data our AI gets, the game has already been lost.)
Given that assumption, this problem seems potentially solvable
If the SmartVault learns a policy from data which was all generated prior to the installation of the new sensors, it seems unlikely that policy would deliberately account for the existence of those specific new sensors. We could leave info about your past actions out of the dataset as well.
An alternative approach in response to this bit:
The scenario is: we’re learning a function F1(A, S1) → D where A is an action sequence, S1 is readings from the known sensor, and D is a diamond location. Previously we’ve discussed two functions which both achieve perfect loss on our training data:
D1(A, S1) -- a direct translator which takes A and S1 into account
H1(A, S1) -- a simulation of what a human would believe if they saw A and S1
Let’s also consider two other functions:
D2(A, S1, S2) -- a direct translator which takes A, S1, and S2 into account
H2(A, S1, S2) -- a simulation of what a human would believe if they saw A, S1, and S2
Your concern is that there is a third function on the original (A, S1) domain which also achieves perfect loss:
H1′(A, S1) = H2(A, S1, P_S2(A, S1)) -- defining P_S2 as a prediction of S2′s readings given A & S1, we have H1′ as a simulation of what a human would believe if they saw A, S1, and readings for S2 predicted from A & S1.
Why would it be bad if gradient descent discovered H1′? Because then when it comes time to learn a policy, we incentivize policies which deceive predicted readings for S2 in addition to S1.
Here’s an idea for obtaining a function on the original (A, S1) domain which does not incentivize policies which deceive S2:
Learn a function F2 on the expanded domain (A, S1, S2), using a training set which is carefully constructed so that the only way to achieve perfect loss is to do a good job of taking readings from S2 into account. (For example, deliberately construct scenarios where the readings from S2 are not what you would expect if you were only looking at A and S1, and make ignoring A & S1 in favor of S2 key to labeling those scenarios correctly.) F2 could be closer to either D2 or H2, I don’t think it matters much.
Define a function F1_only(A, S1) = F2(A, S1, <hardcoded stream of boring S2 sensor data>).
Now let’s use F1_only as the target for learning our policy. I argue a policy has no incentive to deceive S2, because we know that F2 has been optimized to trust its S2 argument over its A and S1 arguments regarding what is going on around S2, and when we call F2 through F1_only, its S2 argument will always be telling it there are no interesting readings coming from S2. So, no bonus points for a policy which tries to fool S2 in addition to S1.
Maybe there is some kind of unintended consequence to this weird setup; I just came up with it and it’s still a bit half-baked in my mind. (Perhaps you could make some kind of exotic argument on the basis of inner optimizers and acausal trade between different system components?) But the meta point is there’s a lot of room for creativity if you don’t anthropomorphize and just think in terms of learning functions on datasets. I think the consequences of the “we control the data our AIs get” assumption could be pretty big if you’re willing to grant it.