This week I will put up either one or two directly relevant posts, which will hopefully make this discussion more clear (though we will definitely still have disagreements). I don’t think I have a solution or anything close, but I do think that there is a plausible approach, which we can hopefully either get to work or find to be unworkable.
My hope is to do some kind of robustness amplification (analogous to reliability amplification or capability amplification), which increases the difficulty of finding problematic inputs. We discussed this briefly in the original ALBA post (esp. this comment).
That is, I want to maintain some property like “is OK with high probability for any input distribution that is sufficiently easy to sample from.”
Reliability amplification increases the success probability for any fixed input, assuming the success probability starts off high enough, and by iterating it we can hopefully get an exponentially good success probability.
Analogously, there are some inputs on which we may always fail. Intuitively we want to shrink the size of the set of bad inputs, assuming only that “most inputs” are initially OK, so that by iterating we can make the bad set exponentially small. I think “difficulty of finding a bad input” is the right way to formalize something like “density of bad inputs.”
This process will never fix the vulnerabilities introduced by the learning process (the kind of thing I’m talking about here). And there may be limits on the maximal robustness of any function that can be learned by a particular model. We need to account for both of those things separately. In the best case robustness amplification would let us remove the vulnerabilities that came from the overseer.
This week I will put up either one or two directly relevant posts, which will hopefully make this discussion more clear (though we will definitely still have disagreements). I don’t think I have a solution or anything close, but I do think that there is a plausible approach, which we can hopefully either get to work or find to be unworkable.
My hope is to do some kind of robustness amplification (analogous to reliability amplification or capability amplification), which increases the difficulty of finding problematic inputs. We discussed this briefly in the original ALBA post (esp. this comment).
That is, I want to maintain some property like “is OK with high probability for any input distribution that is sufficiently easy to sample from.”
Reliability amplification increases the success probability for any fixed input, assuming the success probability starts off high enough, and by iterating it we can hopefully get an exponentially good success probability.
Analogously, there are some inputs on which we may always fail. Intuitively we want to shrink the size of the set of bad inputs, assuming only that “most inputs” are initially OK, so that by iterating we can make the bad set exponentially small. I think “difficulty of finding a bad input” is the right way to formalize something like “density of bad inputs.”
This process will never fix the vulnerabilities introduced by the learning process (the kind of thing I’m talking about here). And there may be limits on the maximal robustness of any function that can be learned by a particular model. We need to account for both of those things separately. In the best case robustness amplification would let us remove the vulnerabilities that came from the overseer.