Steven Byrnes comments on Thoughts on safety in predictive learning

Steven Byrnes Aug 5, 2021, 5:36 PM
LW: 4 AF: 2
AF
Thanks!
Or do you mean literally that the world-model uses row-hammer on the computer it runs, to make the supervisory signal positive?
Yes!
If row-hammering (or whatever) improves the loss, then the gradient will push in that direction.
I don’t think this is true in the situation I’m talking about (“literally that the world-model uses row-hammer on the computer it runs, to make the supervisory signal positive”).
Let’s say we have weights θ, and loss is nominally the function f(θ), but the actual calculated loss is F(θ). Normally f(θ)=F(θ), but there are certain values of θ for which merely running the trained model corrupts the CPU, and thus the bits in the loss register are not what they’re supposed to be according to the nominal algorithm. In those cases f(θ)≠F(θ).
Anyway, when the computer does symbolic differentiation / backprop, it’s calculating ∇f, not ∇F. So it won’t necessarily walk its way towards the minimum of F.
I can’t imagine any processing step without 4th-wall-breaking consequences
Oh yeah, for sure. My idea was: sometimes the 4th-wall-breaking consequences are part of the reason that the processing step is there in the first place, and sometimes the 4th-wall-breaking consequences are just an incidental unintended side-effect, sorta an “externality”.
Like, as the saying goes, maybe a butterfly flapping its wings in Mexico will cause a tornado in Kansas three months later. But that’s not why the butterfly flapped its wings. If I’m working on the project of understanding the butterfly—why does it do the things it does? why is it built the way it’s built?—knowing that there was a tornado in Kansas is entirely unhelpful. It contributes literally nothing whatsoever to my success in this butterfly-explanation project.
So by the same token, I think it’s possible that we can work on the project of understanding a postdictively-trained model—why does it do the things it does? why is it built the way it’s built?—and find that thinking about the 4th-wall-breaking consequences of the processing steps is entirely unhelpful for this project.
I don’t see any reason why a sufficiently advance postdictive learner with a general enough modality (like text) wouldn’t learn to model 4th-wall-breaking consequences: that’s just the sort of thing you need to predict security exploits or AI alignment posts like this one.
Of course a good postdictive learner will learn that other algorithms can be manipulative, and it could even watch itself in a mirror and understand the full range of things that it could do (see the part of this post “Let’s take a postdictive learner, and grant it “self-awareness”…”). Hmm, maybe the alleged mental block I have in mind is something like “treating one’s own processing steps as being part of the physical universe, as opposed to taking the stance where you’re trying to the universe from outside it”. I think an algorithm could predict that security researchers can find security exploits, and predict that AI alignment researchers could write comments like this one, while nevertheless “trying to understand the universe from outside it”.
there’s still the risk of spinning agents early in training
Oh yeah, for sure, in fact I think there’s a lot of areas where we need to develop safety-compatible motivations as soon as possible, and where there’s some kind of race to do so (see “Fraught Valley” section here). I mean, “hacking into the training environment” is in that category too—you want to install the safety-compatible motivation (where the model doesn’t want to hack into the training environment) sooner than the model becomes a superintelligent adversary trying to hack into the training environment. I don’t like those kinds of races and wish I had better ideas for avoiding them.
- adamShimi Aug 8, 2021, 9:21 PM
  LW: 2 AF: 1
  AF Parent
  Let’s say we have weights θ, and loss is nominally the function f(θ), but the actual calculated loss is F(θ). Normally f(θ)=F(θ), but there are certain values of θ for which merely running the trained model corrupts the CPU, and thus the bits in the loss register are not what they’re supposed to be according to the nominal algorithm. In those cases f(θ)≠F(θ).
  Anyway, when the computer does symbolic differentiation / backprop, it’s calculating ∇f, not ∇F. So it won’t necessarily walk its way towards the minimum of F
  Explained like that, it makes sense. And that’s something I hadn’t thought about.
  So by the same token, I think it’s possible that we can work on the project of understanding a postdictively-trained model—why does it do the things it does? why is it built the way it’s built?—and find that thinking about the 4th-wall-breaking consequences of the processing steps is entirely unhelpful for this project.
  Completely agree. This is part of my current reasoning for why GPT-3 (and maybe GPT-N) aren’t incentivized for predict-o-matic behavior.
  Hmm, maybe the alleged mental block I have in mind is something like “treating one’s own processing steps as being part of the physical universe, as opposed to taking the stance where you’re trying to the universe from outside it”. I think an algorithm could predict that security researchers can find security exploits, and predict that AI alignment researchers could write comments like this one, while nevertheless “trying to understand the universe from outside it”.
  I’m confused by that paragraph: you sound like you’re saying that the postdictive learner would not see itself as outside the universe in one sentence and would do so in the next. Either way, it seemed linked with the 1st person problem we’re discussing in your research update: this is a situation where you seem to expect that the translation into 1st person knowledge isn’t automatic, and so can be controlled, incentivized or not.