Edit: think this isn’t quite right in general, will try to make it more correct later
Here’s a sketch of a strategy for trying to fix Strategy: penalize depending on “downstream” variables. Would appreciate feedback on whether it’s modeling the difficulty correctly/seems possibly worth figuring out how to implement
It seems like the problem is:
On the training set, there are a number of implicit variables X that are indistinguishable (always all true or always all false)
A. Is the diamond safe at time t-1
B. Is the diamond safe at time t (the variable we actually care about)
C. Do photons bounce off of the diamond at time t
D. Are photons in transit between the diamond and the camera at time t+1
E. Do photons arrive in the camera at time t+2
F. Do photons trigger the camera’s electronics at time t+3
G. Does the human correctly interpret the signal from the camera
H. Does the human believe “the diamond is safe”
…
The reporter can get low loss if it has a fast way to predict that value of any of these variables. This is fine on the training set, because the variables are all perfectly correlated.
But it might be possible to create ways for some of these variables to hold, but not others off the training set. The thing that is going wrong in Counterexample: “what the human will believe” can be understood before “what is actually true” seems like “E through H are true in this case, but A-D are false”.
Possible approach:
Instead, we could try to learn a reporter that returns true if and only if all relevant variables A-H are true
Then the remaining problem should just be that this reporter is overly conservative: there might be some strategies that would be helpful that involve changing some of these variables, e.g. moving the diamond around the room (changing A) or turning off the lights in the room (changing C-H, which even causes the human to believe that the diamond isn’t safe)
Edit: think this isn’t quite right in general, will try to make it more correct later
Here’s a sketch of a strategy for trying to fix Strategy: penalize depending on “downstream” variables. Would appreciate feedback on whether it’s modeling the difficulty correctly/seems possibly worth figuring out how to implement
It seems like the problem is:
On the training set, there are a number of implicit variables X that are indistinguishable (always all true or always all false)
A. Is the diamond safe at time t-1
B. Is the diamond safe at time t (the variable we actually care about)
C. Do photons bounce off of the diamond at time t
D. Are photons in transit between the diamond and the camera at time t+1
E. Do photons arrive in the camera at time t+2
F. Do photons trigger the camera’s electronics at time t+3
G. Does the human correctly interpret the signal from the camera
H. Does the human believe “the diamond is safe”
…
The reporter can get low loss if it has a fast way to predict that value of any of these variables. This is fine on the training set, because the variables are all perfectly correlated.
But it might be possible to create ways for some of these variables to hold, but not others off the training set. The thing that is going wrong in Counterexample: “what the human will believe” can be understood before “what is actually true” seems like “E through H are true in this case, but A-D are false”.
Possible approach:
Instead, we could try to learn a reporter that returns true if and only if all relevant variables A-H are true
Then the remaining problem should just be that this reporter is overly conservative: there might be some strategies that would be helpful that involve changing some of these variables, e.g. moving the diamond around the room (changing A) or turning off the lights in the room (changing C-H, which even causes the human to believe that the diamond isn’t safe)