adamShimi comments on Thoughts on safety in predictive learning

adamShimi Aug 8, 2021, 9:21 PM
LW: 2 AF: 1
AF
Let’s say we have weights θ, and loss is nominally the function f(θ), but the actual calculated loss is F(θ). Normally f(θ)=F(θ), but there are certain values of θ for which merely running the trained model corrupts the CPU, and thus the bits in the loss register are not what they’re supposed to be according to the nominal algorithm. In those cases f(θ)≠F(θ).
Anyway, when the computer does symbolic differentiation / backprop, it’s calculating ∇f, not ∇F. So it won’t necessarily walk its way towards the minimum of F
Explained like that, it makes sense. And that’s something I hadn’t thought about.
So by the same token, I think it’s possible that we can work on the project of understanding a postdictively-trained model—why does it do the things it does? why is it built the way it’s built?—and find that thinking about the 4th-wall-breaking consequences of the processing steps is entirely unhelpful for this project.
Completely agree. This is part of my current reasoning for why GPT-3 (and maybe GPT-N) aren’t incentivized for predict-o-matic behavior.
Hmm, maybe the alleged mental block I have in mind is something like “treating one’s own processing steps as being part of the physical universe, as opposed to taking the stance where you’re trying to the universe from outside it”. I think an algorithm could predict that security researchers can find security exploits, and predict that AI alignment researchers could write comments like this one, while nevertheless “trying to understand the universe from outside it”.
I’m confused by that paragraph: you sound like you’re saying that the postdictive learner would not see itself as outside the universe in one sentence and would do so in the next. Either way, it seemed linked with the 1st person problem we’re discussing in your research update: this is a situation where you seem to expect that the translation into 1st person knowledge isn’t automatic, and so can be controlled, incentivized or not.