5 years later, I’m finally reading this post. Thanks for the extended discussions of postdictive learning; it’s really relevant to my current thinking about alignment for potential simulators-like Language Models.
Note that others disagree, e.g. advocates of Microscope AI.
I don’t think advocates of Microscope AI think you can reach AGI that way. More that through Microscope AI, we might end up solving the problems we have without relying on an agent.
Why? Because in predictive training, the system can (under some circumstances) learn to make self-fulfilling prophecies—in other words, it can learn to manipulate the world, not just understand it. For example see Abram Demski’s Parable of the Predict-O-Matic. In postdictive training, the answer is already locked in when the system is guessing it, so there’s no training incentive to manipulate the world. (Unless it learns to hack into the answer by row-hammer or whatever. I’ll get back to that in a later section.)
Agreed, but I think you could be even clearer that the real point is that postdiction can never causally influence the output. As you write there are cases and version where prediction also has this property, but it’s not a guarantee by default.
As for the actual argument, that’s definitely part of my reasoning why I don’t expect GPT-N to have deceptive incentives (although maybe what it simulates would have).
In backprop, but not trial-and-error, and not numerical differentiation, we get some protection against things like row-hammering the supervisory signal.
Even after reading the wikipedia page, it’s not clear to me what “row-hammering the supervisory signa”l would look like. Notably, I don’t see the analogy to the electrical interaction here. Or do you mean literally that the world-model uses row-hammer on the computer it runs, to make the supervisory signal positive?
The differentiation engine is essentially symbolic, so it won’t (and indeed can’t) “differentiate through” the effects of row-hammer or whatever.
No idea what this means. If row-hammering (or whatever) improves the loss, then the gradient will push in that direction. I feel like the crux is in the specific way you imagine row-hammering happening here, so I’d like to know more about it.
Easy win #3: Don’t access the world-model and then act on that information, at least not without telling it
Slight nitpicking, but this last one doesn’t sound like an easy win to me—just an argument for not using a naive safety strategy. I mean, it’s not like we really get anything in terms of safety, we just don’t mess up the capabilities of the model completely.
(Human example of this error: Imagine someone saying “If fast-takeoff AGI happens, then it would have bizarre consequence X, and there’s no way you really expect that to happen, right?!? So c’mon, there’s not really gonna be fast-takeoff AGI.”. This is an error because if there’s a reason to expect fast-takeoff AGI, and fast-takeoff AGI leads to X, we should make the causal update (“X is more likely than I thought”), not the retrocausal update (“fast-takeoff AGI is less likely than I thought”). Well, probably. I guess on second thought it’s not always a reasoning error.)
I see what you did there. (Joke apart, that’s a telling example)
And, like other reasoning errors and imperfect heuristics, I expect that it’s self-correcting—i.e., it would manifest more early in training, but gradually go away as the AGI learns meta-cognitive self-monitoring strategies. It doesn’t seem to have unusually dangerous consequences, compared to other things in that category, AFAICT.
One way to make this argument more concrete relies on saying that solving this problem helps capabilities as well as safety. So as long as what we worry is a very capable AGI, this should be mitigated.
There are within-universe consequences of a processing step, where the step causes things to happen entirely within the intended algorithm. (By “intended”, I just mean that the algorithm is running without hardware errors). These same consequences would happen for the same reasons if we run the algorithm under homomorphic encryption in a sealed bunker at the bottom of the ocean.
Then there are 4th-wall-breaking consequences of a processing step, where the step has a downstream chain of causation that passes through things in the real world that are not within-universe. (I mean, yes, the chip’s transistors have real-world-impacts on each other, in a manner that implements the algorithm, but that doesn’t count as 4th-wall-breaking.)
This distinction makes some sense to me, but I’m confused by your phrasing (and thus by what you actually mean). I guess my issue is that stating it like that made me think that you expected processing steps to be one or the other, whereas I can’t imagine any processing step without 4th-wall-breaking consequences. What you do with these, about whether the 4th-wall-breaking consequences are reasons for specific actions, makes it clearer IMO.
Out-of-distribution, maybe the criterion in question diverges from a good postdiction-generation strategy. Oh well, it will make bad postdictions for a while, until gradient descent fixes it. That’s a capability problem, not a safety problem.
Agreed. Though, as Evan already pointed, the real worry with mesa-optimizers isn’t proxy alignment but deceptive alignment. And deceptive alignment isn’t just a capability problem.
Another way I’ve been thinking about the issue of mesa-optimizers in GPT-N is the risk of something like malign agents in the models (a bit like this) that GPT-N might be using to simulate different texts. (Oh, I see you already have a section about that)
It seems like there’s no incentive whatsoever for a postdictive learner to have any concept that the data processing steps in the algorithm have any downstream impacts, besides, y’know, processing data within the algorithm. It seems to me like there’s a kind of leap to start taking downstream impacts to be a relevant consideration, and there’s nothing in gradient descent pushing the algorithm to make that leap, and there doesn’t seem to be anything about the structure of the domain or the reasoning it’s likely to be doing that would lead to making that leap, and it doesn’t seem like the kind of thing that would happen by random noise, I think.
Just because I share this intuition, I want to try pushing back against it.
First, I don’t see any reason why a sufficiently advance postdictive learner with a general enough modality (like text) wouldn’t learn to model 4th-wall-breaking consequences: that’s just the sort of thing you need to predict security exploits or AI alignment posts like this one.
Next comes the questions of whether it will take advantage of this. Well, a deceptive mesa-optimizer would have an incentive to use this. So I guess the question boils down to the previous discussion, of whether we should expect postdictive learners to spin deceptive mesa-optimizers.
So a self-aware, aligned AGI could, and presumably would, figure out the idea “Don’t do a step-by-step emulation in your head of a possibly-adversarial algorithm that you don’t understand; or do it in a super-secure sandbox environment if you must”, as concepts encoded in its value function and planner. (Especially if we warn it / steer it away from that.)
I see a thread of turning potential safety issues into capability issues, and then saying that the AGI being competent, it will not have them. I think this makes sense for a really competent AGI, which would not be taken over by budding agents inside its simulation. But there’s still the risk of spinning agents early in training, and if those agents get good enough to take over the model from the inside and become deceptive, competence at the training task become decorrelated with what happens in deployment.
Or do you mean literally that the world-model uses row-hammer on the computer it runs, to make the supervisory signal positive?
Yes!
If row-hammering (or whatever) improves the loss, then the gradient will push in that direction.
I don’t think this is true in the situation I’m talking about (“literally that the world-model uses row-hammer on the computer it runs, to make the supervisory signal positive”).
Let’s say we have weights θ, and loss is nominally the function f(θ), but the actual calculated loss is F(θ). Normally f(θ)=F(θ), but there are certain values of θ for which merely running the trained model corrupts the CPU, and thus the bits in the loss register are not what they’re supposed to be according to the nominal algorithm. In those cases f(θ)≠F(θ).
Anyway, when the computer does symbolic differentiation / backprop, it’s calculating ∇f, not ∇F. So it won’t necessarily walk its way towards the minimum of F.
I can’t imagine any processing step without 4th-wall-breaking consequences
Oh yeah, for sure. My idea was: sometimes the 4th-wall-breaking consequences are part of the reason that the processing step is there in the first place, and sometimes the 4th-wall-breaking consequences are just an incidental unintended side-effect, sorta an “externality”.
Like, as the saying goes, maybe a butterfly flapping its wings in Mexico will cause a tornado in Kansas three months later. But that’s not why the butterfly flapped its wings. If I’m working on the project of understanding the butterfly—why does it do the things it does? why is it built the way it’s built?—knowing that there was a tornado in Kansas is entirely unhelpful. It contributes literally nothing whatsoever to my success in this butterfly-explanation project.
So by the same token, I think it’s possible that we can work on the project of understanding a postdictively-trained model—why does it do the things it does? why is it built the way it’s built?—and find that thinking about the 4th-wall-breaking consequences of the processing steps is entirely unhelpful for this project.
I don’t see any reason why a sufficiently advance postdictive learner with a general enough modality (like text) wouldn’t learn to model 4th-wall-breaking consequences: that’s just the sort of thing you need to predict security exploits or AI alignment posts like this one.
Of course a good postdictive learner will learn that other algorithms can be manipulative, and it could even watch itself in a mirror and understand the full range of things that it could do (see the part of this post “Let’s take a postdictive learner, and grant it “self-awareness”…”). Hmm, maybe the alleged mental block I have in mind is something like “treating one’s own processing steps as being part of the physical universe, as opposed to taking the stance where you’re trying to the universe from outside it”. I think an algorithm could predict that security researchers can find security exploits, and predict that AI alignment researchers could write comments like this one, while nevertheless “trying to understand the universe from outside it”.
there’s still the risk of spinning agents early in training
Oh yeah, for sure, in fact I think there’s a lot of areas where we need to develop safety-compatible motivations as soon as possible, and where there’s some kind of race to do so (see “Fraught Valley” section here). I mean, “hacking into the training environment” is in that category too—you want to install the safety-compatible motivation (where the model doesn’t want to hack into the training environment) sooner than the model becomes a superintelligent adversary trying to hack into the training environment. I don’t like those kinds of races and wish I had better ideas for avoiding them.
Let’s say we have weights θ, and loss is nominally the function f(θ), but the actual calculated loss is F(θ). Normally f(θ)=F(θ), but there are certain values of θ for which merely running the trained model corrupts the CPU, and thus the bits in the loss register are not what they’re supposed to be according to the nominal algorithm. In those cases f(θ)≠F(θ).
Anyway, when the computer does symbolic differentiation / backprop, it’s calculating ∇f, not ∇F. So it won’t necessarily walk its way towards the minimum of F
Explained like that, it makes sense. And that’s something I hadn’t thought about.
So by the same token, I think it’s possible that we can work on the project of understanding a postdictively-trained model—why does it do the things it does? why is it built the way it’s built?—and find that thinking about the 4th-wall-breaking consequences of the processing steps is entirely unhelpful for this project.
Completely agree. This is part of my current reasoning for why GPT-3 (and maybe GPT-N) aren’t incentivized for predict-o-matic behavior.
Hmm, maybe the alleged mental block I have in mind is something like “treating one’s own processing steps as being part of the physical universe, as opposed to taking the stance where you’re trying to the universe from outside it”. I think an algorithm could predict that security researchers can find security exploits, and predict that AI alignment researchers could write comments like this one, while nevertheless “trying to understand the universe from outside it”.
I’m confused by that paragraph: you sound like you’re saying that the postdictive learner would not see itself as outside the universe in one sentence and would do so in the next. Either way, it seemed linked with the 1st person problem we’re discussing in your research update: this is a situation where you seem to expect that the translation into 1st person knowledge isn’t automatic, and so can be controlled, incentivized or not.
5 years later, I’m finally reading this post. Thanks for the extended discussions of postdictive learning; it’s really relevant to my current thinking about alignment for potential simulators-like Language Models.
I don’t think advocates of Microscope AI think you can reach AGI that way. More that through Microscope AI, we might end up solving the problems we have without relying on an agent.
Agreed, but I think you could be even clearer that the real point is that postdiction can never causally influence the output. As you write there are cases and version where prediction also has this property, but it’s not a guarantee by default.
As for the actual argument, that’s definitely part of my reasoning why I don’t expect GPT-N to have deceptive incentives (although maybe what it simulates would have).
Even after reading the wikipedia page, it’s not clear to me what “row-hammering the supervisory signa”l would look like. Notably, I don’t see the analogy to the electrical interaction here. Or do you mean literally that the world-model uses row-hammer on the computer it runs, to make the supervisory signal positive?
No idea what this means. If row-hammering (or whatever) improves the loss, then the gradient will push in that direction. I feel like the crux is in the specific way you imagine row-hammering happening here, so I’d like to know more about it.
Slight nitpicking, but this last one doesn’t sound like an easy win to me—just an argument for not using a naive safety strategy. I mean, it’s not like we really get anything in terms of safety, we just don’t mess up the capabilities of the model completely.
I see what you did there. (Joke apart, that’s a telling example)
One way to make this argument more concrete relies on saying that solving this problem helps capabilities as well as safety. So as long as what we worry is a very capable AGI, this should be mitigated.
This distinction makes some sense to me, but I’m confused by your phrasing (and thus by what you actually mean). I guess my issue is that stating it like that made me think that you expected processing steps to be one or the other, whereas I can’t imagine any processing step without 4th-wall-breaking consequences. What you do with these, about whether the 4th-wall-breaking consequences are reasons for specific actions, makes it clearer IMO.
Agreed. Though, as Evan already pointed, the real worry with mesa-optimizers isn’t proxy alignment but deceptive alignment. And deceptive alignment isn’t just a capability problem.
Another way I’ve been thinking about the issue of mesa-optimizers in GPT-N is the risk of something like malign agents in the models (a bit like this) that GPT-N might be using to simulate different texts. (Oh, I see you already have a section about that)
Just because I share this intuition, I want to try pushing back against it.
First, I don’t see any reason why a sufficiently advance postdictive learner with a general enough modality (like text) wouldn’t learn to model 4th-wall-breaking consequences: that’s just the sort of thing you need to predict security exploits or AI alignment posts like this one.
Next comes the questions of whether it will take advantage of this. Well, a deceptive mesa-optimizer would have an incentive to use this. So I guess the question boils down to the previous discussion, of whether we should expect postdictive learners to spin deceptive mesa-optimizers.
I see a thread of turning potential safety issues into capability issues, and then saying that the AGI being competent, it will not have them. I think this makes sense for a really competent AGI, which would not be taken over by budding agents inside its simulation. But there’s still the risk of spinning agents early in training, and if those agents get good enough to take over the model from the inside and become deceptive, competence at the training task become decorrelated with what happens in deployment.
Thanks!
Yes!
I don’t think this is true in the situation I’m talking about (“literally that the world-model uses row-hammer on the computer it runs, to make the supervisory signal positive”).
Let’s say we have weights θ, and loss is nominally the function f(θ), but the actual calculated loss is F(θ). Normally f(θ)=F(θ), but there are certain values of θ for which merely running the trained model corrupts the CPU, and thus the bits in the loss register are not what they’re supposed to be according to the nominal algorithm. In those cases f(θ)≠F(θ).
Anyway, when the computer does symbolic differentiation / backprop, it’s calculating ∇f, not ∇F. So it won’t necessarily walk its way towards the minimum of F.
Oh yeah, for sure. My idea was: sometimes the 4th-wall-breaking consequences are part of the reason that the processing step is there in the first place, and sometimes the 4th-wall-breaking consequences are just an incidental unintended side-effect, sorta an “externality”.
Like, as the saying goes, maybe a butterfly flapping its wings in Mexico will cause a tornado in Kansas three months later. But that’s not why the butterfly flapped its wings. If I’m working on the project of understanding the butterfly—why does it do the things it does? why is it built the way it’s built?—knowing that there was a tornado in Kansas is entirely unhelpful. It contributes literally nothing whatsoever to my success in this butterfly-explanation project.
So by the same token, I think it’s possible that we can work on the project of understanding a postdictively-trained model—why does it do the things it does? why is it built the way it’s built?—and find that thinking about the 4th-wall-breaking consequences of the processing steps is entirely unhelpful for this project.
Of course a good postdictive learner will learn that other algorithms can be manipulative, and it could even watch itself in a mirror and understand the full range of things that it could do (see the part of this post “Let’s take a postdictive learner, and grant it “self-awareness”…”). Hmm, maybe the alleged mental block I have in mind is something like “treating one’s own processing steps as being part of the physical universe, as opposed to taking the stance where you’re trying to the universe from outside it”. I think an algorithm could predict that security researchers can find security exploits, and predict that AI alignment researchers could write comments like this one, while nevertheless “trying to understand the universe from outside it”.
Oh yeah, for sure, in fact I think there’s a lot of areas where we need to develop safety-compatible motivations as soon as possible, and where there’s some kind of race to do so (see “Fraught Valley” section here). I mean, “hacking into the training environment” is in that category too—you want to install the safety-compatible motivation (where the model doesn’t want to hack into the training environment) sooner than the model becomes a superintelligent adversary trying to hack into the training environment. I don’t like those kinds of races and wish I had better ideas for avoiding them.
Explained like that, it makes sense. And that’s something I hadn’t thought about.
Completely agree. This is part of my current reasoning for why GPT-3 (and maybe GPT-N) aren’t incentivized for predict-o-matic behavior.
I’m confused by that paragraph: you sound like you’re saying that the postdictive learner would not see itself as outside the universe in one sentence and would do so in the next. Either way, it seemed linked with the 1st person problem we’re discussing in your research update: this is a situation where you seem to expect that the translation into 1st person knowledge isn’t automatic, and so can be controlled, incentivized or not.