We’re going to prevent the reporter from simulating a human, by giving the human material information that the reporter doesn’t have.
Consider two camera feeds:
Feed 1 is very low resolution, and/or shows only part of the room.
Feed 2 is high resolution, and/or shows the whole room.
We train a weak predictor using Feed 1, and a strong predictor using Feed 2.
We train a reporter to report the beliefs of the weak predictor, using scenarios labeled by humans with the aid of the strong predictor. The humans can correctly label scenarios that are hard to figure out with Feed 1 alone, by asking the strong predictor to show them its predicted Feed 2. The reporter is unable to simulate the human evaluators because it doesn’t see Feed 2. Even if it has perfect knowledge of the human Bayes net, it doesn’t know what to plug in to the knowledge nodes!
Then we fine-tune the reporter to work with the strong predictor to elicit its beliefs. I haven’t figured out how to do this last step, maybe it’s hard?
I’m interpreting this as something like: “Train the predictor on lots of cases until it becomes incredibly good; then train the reporter only on the data points with missing information, so that it learns to do direct translation from the predictor to human concepts; then hope that reporter continues to do direct translation on other data points.”
The problem as I see it is that once the predictor is good enough that it can get data points right despite missing crucial information, it is also (potentially) good enough that it can learn how to imitate “what the human would think had happened if they had more information.” Both of these perform equally well, and the existing assumption is that human imitation is easier to learn than direct translation, so I think by default (according to the contest assumptions) you get the latter.
“Train the predictor on lots of cases until it becomes incredibly good; then train the reporter only on the data points with missing information, so that it learns to do direct translation from the predictor to human concepts; then hope that reporter continues to do direct translation on other data points.”
That’s different from what I had in mind, but better! My proposal had two separate predictors, and what it did is reduce the human ↔ strong predictor OI problem (OI = “ontology identification”, defined in the ELK paper) to the weak predictor ↔ strong predictor OI problem. The latter problem might be easier, but I certainly don’t see how to solve it!
Your version is better because it bypasses the OI problem entirely (the two predictors are the same!)
Now for the problem you point out:
The problem as I see it is that once the predictor is good enough that it can get data points right despite missing crucial information,
Here’s how I propose to block this. Let (v1,a) be a high-quality video and an action sequence. Given this pair, the predictor outputs a high-quality video v2 of its predicted outcome. Then we downsample v1 and v2 to low-quality v′1 and v′2, and train the reporter on the tuple (v′1,a,v′2,x) where x is the human label informed by the high-quality v1 and v2.
We choose training data such that
1. The human can label perfectly given the high-quality data (v1,a,v2); and
2. The predictor doesn’t know for sure what is happening from the low-quality data (v′1,a,v′2) alone.
Let’s compare the direct reporter (which truthfully reports the probability that the diamond is in the room, as estimated by the predictor who only has the low-quality data) with the human simulator.
The direct reporter will not get perfect reward, since the predictor is genuinely uncertain. Sometimes the predictor’s probability is strictly between 0 and 1, so it gets some loss.
But the human simulator will do worse than the direct reporter, because it has no access to the high-quality data. It can simulate what the human would predict from the low-quality data, but that is strictly worse than what the predictor predicts from the low-quality data.
I agree that we still have to “hope that reporter continues to do direct translation on other data points”, and maybe there’s a counterexample that shows it won’t? But at the very least the human simulator is no longer a failure mode!
To understand this more thoroughly I’m simplifying the high and low quality video feeds to lists of states that correspond to reality. (This simplification might be unfair so I’m not sure this is a true break of your original proposal, but I think it helped me think about general breaking strategies.)
Ok, video feeds compressed to arrays:
We consider scenarios in fixed order. If the diamond is present, we record a 1, and if not, a 0. The high quality feed gives us a different array than the low quality mode (otherwise the low quality mode is not helpful). E.g., High reports: (1,0,1,1,0, …); Low: (1,0,1,?,0,...)
There are two possible ways that gap can get resolved.
In case one, the low quality predictor has a powerful enough model of reality to effectively derive the High quality data. (We might find this collapses to the original problem, because it has somehow reconstructed the high quality stream from the low quality stream, then proceeds as normal. You might argue that’s computationally expensive, ok, then let’s proceed to case two.)
In case two, the low quality datafeed predictor predicts wrongly.
(I know you are saying it predicts *uncertainly,* but we still have to have some framework to map uncertainty to a state, we have to round one way or the other. If uncertainty avoids loss, the predictor will be preferentially inconclusive all the time. If we round uncertainty up, effectively we’re in case one. If we round down, effectively case two.)
So we could sharpen case two and say that sometimes the AI’s camera intentionally lies to it on some random subset of scenarios. And the AI finds itself in a chaotic world where it is sometimes punished for predicting what it just knows to be true things.
In that case, although it’s easy to show how it would diverge from human simulation, it also might not simulate reality very well either, since deriving the algorithm generating the lies might be too computationally complex. (Or maybe it can derive and counter the liar, in which case we’re back at case 1, ie, the original problem.) If liar simulation is impossible, then the optimal predictor might just hit a ceiling and accepting some level of noise. Effectively this means we have a new problem—there is no direct translation possible, because the predictor is viewing a “different” world than the human.
I simplified your construct, possibly unfairly, and maybe that’s a way you can salvage your original build. But this was a really illuminating exercise for me to generalize the strategy.
I think there are some classes of builds (maybe yours escapes this) where if you overfit on preventing human simulation, you let direct translation slip away. And then if you rehabilitate direct translation, you have to reexamine if there’s an escape for human simulation. This sort of disjunctive analysis seems like an important strategy for adversarial breakers.
You still may be able to get the bedsheet over both corners, but I think other breakers in general will want to start with some disjunctive approach like this in other cases.
I know you are saying it predicts *uncertainly,* but we still have to have some framework to map uncertainty to a state, we have to round one way or the other. If uncertainty avoids loss, the predictor will be preferentially inconclusive all the time.
There’s a standard trick for scoring an uncertain prediction: It outputs its probability estimate p that the diamond is in the room, and we score it with loss −log(p) if the diamond is really there, −log(1−p) otherwise. Truthfully reporting p minimizes its loss.
So we could sharpen case two and say that sometimes the AI’s camera intentionally lies to it on some random subset of scenarios
You’re saying that giving it less information (by replacing its camera feed with a lower quality feed) is equivalent to sometimes lying to it? I don’t see the equivalence!
if you overfit on preventing human simulation, you let direct translation slip away.
Happy to try to clarify, and this is helping me rethink my own thoughts, so appreciate the prompts. I’m playing with new trains of thought here and so have pretty low confidence in where I ended up, so greatly appreciate any further clarifications or responses you have.
There’s a standard trick for scoring an uncertain prediction: It outputs its probability estimate p
Yup, understand that is how to effectively score uncertainty. I was very wrong to phrase this as “we still have to have some framework to map uncertainty to a state” because you don’t strictly have to do anything, you can just use probabilities.
Restricting this to discrete, binary states allows us to simplify the comparison between models for this discussion. I will claim we can do so with no loss of fidelity (leaning heavily on Shannon, ie, this is all just information, encoding it to binary and back out again doesn’t mess anything up). And doing so is not obliged, but useful.
I really shouldn’t have said “you must X!” I should have said “it’s kind of handy if you X,” sorry for that confusion.
You’re saying that giving it less information (by replacing its camera feed with a lower quality feed) is equivalent to sometimes lying to it? I don’t see the equivalence!
We have a high quality information stream and a low quality information stream, and they both gesture vaguely at the ultimate high quality information stream, namely, the true facts of the matter of the world itself. Say, LQ < HQ < W.
LQ may be low quality because it is missing information in HQ, it may just be a subset of HQ, like a lower resolution video. Or it may have actual noise, false information.
If we have a powerful algorithm, we may be able to, at least asymptomatically, convert LQ to HQ, using processing power. So maybe in some cases LQ + processing = HQ exactly. But that makes the distinction uninteresting, and you would likely have to further degrade v′1 to get the effect you are looking for, so let’s discard that and consider only cases where v′1 is strictly worse.
You can now use a NAND to sort the outputs of LQ and HQ into two buckets:
A stream of outputs that all agree.
A stream of outputs that all disagree.
So for bucket 1, there are aspects of the world where there’s effectively no loss in quality. But comparing HQ with HQ is not useful, so let’s discard those cases, and examine the corners where LQ and HQ disagree.
LQ effectively has false information about some subset of reality there, that is in a sense what “LQ” means.
(Or just has gaps, which resolve to approximate HQ after processing, or fail and resolve to noise, either way.)
if you overfit on preventing human simulation, you let direct translation slip away
Rereading, I think HoldenK started down this path, “once the predictor is good enough that it can get data points right despite missing crucial information, it is also (potentially) good enough that it can learn how to imitate “what the human would think had happened if they had more information.”″
So for your block—in a sense you’re giving the human some information the predictor lacks. You’re giving the human “hints,” in the form of higher quality input, which helps get the human closer to perfectly representing the actual world. (Not completely, sometimes there’s still uncertainty, but closer than the predictor is likely to get.)
If that gets the human to “perfect”, then the best the predictor can do is asymptotically approach human prediction and direct translation at the same time.
My Weak Spots
I think one likely objection to what I wrote here is that I am abusing Shannon. I’ve considered that, would be happy to discuss it more and carefully consider objections along those lines, but I think toy examples would get us there. And without taking away from your notes about how “Sometimes the predictor’s probability is strictly between 0 and 1, so it gets some loss.” If p(I eat soup) is 0.6 for all days, let’s just ask ten discrete questions, “across n days the number of soups I eat will converge to n/1? (T/F), n/2? (T/F), …” I would definitely try to preserve performance and scoring, I just want to run the NAND.
I think another likely objection is that when we apply models, trying to get m(HQ) = ~W, then it relies on interactions of states in complex ways where we can’t slice them randomly into two groups without disrupting how models work at the basic level. I think the response is to simply group these states into bigger subsets of outcomes and treat those as atomic.
I think the biggest and most important objection would be that I’ve misunderstood your block. I would welcome any clarifications, and especially appreciate a toy example if you could, even if not involving diamonds, just to make sure I definitely get what you’re saying in that part.
I’d be interested in other objections or weak spots here, appreciate your time helping me to think this through more carefully and completely.
Idea: Withhold Material Information
We’re going to prevent the reporter from simulating a human, by giving the human material information that the reporter doesn’t have.
Consider two camera feeds:
Feed 1 is very low resolution, and/or shows only part of the room.
Feed 2 is high resolution, and/or shows the whole room.
We train a weak predictor using Feed 1, and a strong predictor using Feed 2.
We train a reporter to report the beliefs of the weak predictor, using scenarios labeled by humans with the aid of the strong predictor. The humans can correctly label scenarios that are hard to figure out with Feed 1 alone, by asking the strong predictor to show them its predicted Feed 2. The reporter is unable to simulate the human evaluators because it doesn’t see Feed 2. Even if it has perfect knowledge of the human Bayes net, it doesn’t know what to plug in to the knowledge nodes!
Then we fine-tune the reporter to work with the strong predictor to elicit its beliefs. I haven’t figured out how to do this last step, maybe it’s hard?
I’m interpreting this as something like: “Train the predictor on lots of cases until it becomes incredibly good; then train the reporter only on the data points with missing information, so that it learns to do direct translation from the predictor to human concepts; then hope that reporter continues to do direct translation on other data points.”
The problem as I see it is that once the predictor is good enough that it can get data points right despite missing crucial information, it is also (potentially) good enough that it can learn how to imitate “what the human would think had happened if they had more information.” Both of these perform equally well, and the existing assumption is that human imitation is easier to learn than direct translation, so I think by default (according to the contest assumptions) you get the latter.
That’s different from what I had in mind, but better! My proposal had two separate predictors, and what it did is reduce the human ↔ strong predictor OI problem (OI = “ontology identification”, defined in the ELK paper) to the weak predictor ↔ strong predictor OI problem. The latter problem might be easier, but I certainly don’t see how to solve it!
Your version is better because it bypasses the OI problem entirely (the two predictors are the same!)
Now for the problem you point out:
Here’s how I propose to block this. Let (v1,a) be a high-quality video and an action sequence. Given this pair, the predictor outputs a high-quality video v2 of its predicted outcome. Then we downsample v1 and v2 to low-quality v′1 and v′2, and train the reporter on the tuple (v′1,a,v′2,x) where x is the human label informed by the high-quality v1 and v2.
We choose training data such that
1. The human can label perfectly given the high-quality data (v1,a,v2); and
2. The predictor doesn’t know for sure what is happening from the low-quality data (v′1,a,v′2) alone.
Let’s compare the direct reporter (which truthfully reports the probability that the diamond is in the room, as estimated by the predictor who only has the low-quality data) with the human simulator.
The direct reporter will not get perfect reward, since the predictor is genuinely uncertain. Sometimes the predictor’s probability is strictly between 0 and 1, so it gets some loss.
But the human simulator will do worse than the direct reporter, because it has no access to the high-quality data. It can simulate what the human would predict from the low-quality data, but that is strictly worse than what the predictor predicts from the low-quality data.
I agree that we still have to “hope that reporter continues to do direct translation on other data points”, and maybe there’s a counterexample that shows it won’t? But at the very least the human simulator is no longer a failure mode!
This is really interesting.
To understand this more thoroughly I’m simplifying the high and low quality video feeds to lists of states that correspond to reality. (This simplification might be unfair so I’m not sure this is a true break of your original proposal, but I think it helped me think about general breaking strategies.)
Ok, video feeds compressed to arrays:
We consider scenarios in fixed order. If the diamond is present, we record a 1, and if not, a 0. The high quality feed gives us a different array than the low quality mode (otherwise the low quality mode is not helpful). E.g., High reports: (1,0,1,1,0, …); Low: (1,0,1,?,0,...)
There are two possible ways that gap can get resolved.
In case one, the low quality predictor has a powerful enough model of reality to effectively derive the High quality data. (We might find this collapses to the original problem, because it has somehow reconstructed the high quality stream from the low quality stream, then proceeds as normal. You might argue that’s computationally expensive, ok, then let’s proceed to case two.)
In case two, the low quality datafeed predictor predicts wrongly.
(I know you are saying it predicts *uncertainly,* but we still have to have some framework to map uncertainty to a state, we have to round one way or the other. If uncertainty avoids loss, the predictor will be preferentially inconclusive all the time. If we round uncertainty up, effectively we’re in case one. If we round down, effectively case two.)
So we could sharpen case two and say that sometimes the AI’s camera intentionally lies to it on some random subset of scenarios. And the AI finds itself in a chaotic world where it is sometimes punished for predicting what it just knows to be true things.
In that case, although it’s easy to show how it would diverge from human simulation, it also might not simulate reality very well either, since deriving the algorithm generating the lies might be too computationally complex. (Or maybe it can derive and counter the liar, in which case we’re back at case 1, ie, the original problem.) If liar simulation is impossible, then the optimal predictor might just hit a ceiling and accepting some level of noise. Effectively this means we have a new problem—there is no direct translation possible, because the predictor is viewing a “different” world than the human.
I simplified your construct, possibly unfairly, and maybe that’s a way you can salvage your original build. But this was a really illuminating exercise for me to generalize the strategy.
I think there are some classes of builds (maybe yours escapes this) where if you overfit on preventing human simulation, you let direct translation slip away. And then if you rehabilitate direct translation, you have to reexamine if there’s an escape for human simulation. This sort of disjunctive analysis seems like an important strategy for adversarial breakers.
You still may be able to get the bedsheet over both corners, but I think other breakers in general will want to start with some disjunctive approach like this in other cases.
Thanks for the comment!
There’s a standard trick for scoring an uncertain prediction: It outputs its probability estimate p that the diamond is in the room, and we score it with loss −log(p) if the diamond is really there, −log(1−p) otherwise. Truthfully reporting p minimizes its loss.
You’re saying that giving it less information (by replacing its camera feed with a lower quality feed) is equivalent to sometimes lying to it? I don’t see the equivalence!
That’s an interesting thought, can you elaborate?
Happy to try to clarify, and this is helping me rethink my own thoughts, so appreciate the prompts. I’m playing with new trains of thought here and so have pretty low confidence in where I ended up, so greatly appreciate any further clarifications or responses you have.
Yup, understand that is how to effectively score uncertainty. I was very wrong to phrase this as “we still have to have some framework to map uncertainty to a state” because you don’t strictly have to do anything, you can just use probabilities.
Restricting this to discrete, binary states allows us to simplify the comparison between models for this discussion. I will claim we can do so with no loss of fidelity (leaning heavily on Shannon, ie, this is all just information, encoding it to binary and back out again doesn’t mess anything up). And doing so is not obliged, but useful.
I really shouldn’t have said “you must X!” I should have said “it’s kind of handy if you X,” sorry for that confusion.
We have a high quality information stream and a low quality information stream, and they both gesture vaguely at the ultimate high quality information stream, namely, the true facts of the matter of the world itself. Say, LQ < HQ < W.
LQ may be low quality because it is missing information in HQ, it may just be a subset of HQ, like a lower resolution video. Or it may have actual noise, false information.
If we have a powerful algorithm, we may be able to, at least asymptomatically, convert LQ to HQ, using processing power. So maybe in some cases LQ + processing = HQ exactly. But that makes the distinction uninteresting, and you would likely have to further degrade v′1 to get the effect you are looking for, so let’s discard that and consider only cases where v′1 is strictly worse.
You can now use a NAND to sort the outputs of LQ and HQ into two buckets:
A stream of outputs that all agree.
A stream of outputs that all disagree.
So for bucket 1, there are aspects of the world where there’s effectively no loss in quality. But comparing HQ with HQ is not useful, so let’s discard those cases, and examine the corners where LQ and HQ disagree.
LQ effectively has false information about some subset of reality there, that is in a sense what “LQ” means.
(Or just has gaps, which resolve to approximate HQ after processing, or fail and resolve to noise, either way.)
Rereading, I think HoldenK started down this path, “once the predictor is good enough that it can get data points right despite missing crucial information, it is also (potentially) good enough that it can learn how to imitate “what the human would think had happened if they had more information.”″
So for your block—in a sense you’re giving the human some information the predictor lacks. You’re giving the human “hints,” in the form of higher quality input, which helps get the human closer to perfectly representing the actual world. (Not completely, sometimes there’s still uncertainty, but closer than the predictor is likely to get.)
If that gets the human to “perfect”, then the best the predictor can do is asymptotically approach human prediction and direct translation at the same time.
My Weak Spots
I think one likely objection to what I wrote here is that I am abusing Shannon. I’ve considered that, would be happy to discuss it more and carefully consider objections along those lines, but I think toy examples would get us there. And without taking away from your notes about how “Sometimes the predictor’s probability is strictly between 0 and 1, so it gets some loss.” If p(I eat soup) is 0.6 for all days, let’s just ask ten discrete questions, “across n days the number of soups I eat will converge to n/1? (T/F), n/2? (T/F), …” I would definitely try to preserve performance and scoring, I just want to run the NAND.
I think another likely objection is that when we apply models, trying to get m(HQ) = ~W, then it relies on interactions of states in complex ways where we can’t slice them randomly into two groups without disrupting how models work at the basic level. I think the response is to simply group these states into bigger subsets of outcomes and treat those as atomic.
I think the biggest and most important objection would be that I’ve misunderstood your block. I would welcome any clarifications, and especially appreciate a toy example if you could, even if not involving diamonds, just to make sure I definitely get what you’re saying in that part.
I’d be interested in other objections or weak spots here, appreciate your time helping me to think this through more carefully and completely.