(1) a (good) pivotal act is probably a non-myopic problem, and (2) you can’t solve a nontrivial nonmyopic problem with a myopic solver. [...] My guess is that you have some idea of how a myopic solver can solve a nonmyopic problem (by having it output whatever HCH would do, for instance).
Yeah, that’s right, I definitely agree with (1) and disagree with (2).
And then Eliezer would probably reply that the non-myopia has been wrapped up somewhere else (e.g. in HCH), and that has become the dangerous part (or, more realistically, the insufficiently capable part, and I expect Eliezer would claim that replacing it with something both sufficiently capable and aligned is about as hard as the whole alignment problem).
I tend to think that HCH is not dangerous, but I agree that it’s likely insufficiently capable. To solve that problem, we have to do go to a myopic objective that is more powerful. But that’s not that hard, and there’s lots of them that can incentivize good non-myopic behavior that are safe to optimize for as long as the optimizer is myopic.
AI safety via market making is one example, but it’s a very tricky one, so maybe not the best candidate for showcasing what I mean. In particular, I suspect that a myopic optimizer given the goal of acting as a trader or market-maker in such a setup wouldn’t act deceptively, though I suspect they would Goodhart on the human approval signal in unsafe ways (which is less bad of a problem than deception, and could potentially be solved via something like my step (6), but still a pretty serious problem).
Maybe a better example would be something like imitative generalization. If imitating HCH is insufficient, we can push further by replacing “imitate HCH” with “output the hypothesis which maximizes HCH’s prior times the hypothesis’s likelihood,” which gets you substantially farther and I think is still safe to optimize for given a myopic optimizer (though neither are safe for a non-myopic optimizer).
It still doesn’t seem to me like you’ve sufficiently answered the objection here.
I tend to think that HCH is not dangerous, but I agree that it’s likely insufficiently capable. To solve that problem, we have to do go to a myopic objective that is more powerful.
What if any sufficiently powerful objective is non-myopic? Or, on a different-but-equivalent phrasing: what if myopia is a property only of very specific toy objectives, rather than a widespread property of objectives in general (including objectives that humans would intuitively consider to be aimed at accomplishing things “in the real world”)?
It seems to me that Eliezer has presented quite compelling arguments that the above is the case, and on a first pass it doesn’t look to me like you’ve countered those arguments.
But that’s not that hard, and there’s lots of them that can incentivize good non-myopic behavior that are safe to optimize for as long as the optimizer is myopic.
How does a “myopic optimizer” successfully reason about problems that require non-myopic solutions, i.e. solutions whose consequences extend past whatever artificial time-frame the optimizer is being constrained to reason about? To the extent that it does successfully reason about those things in a non-myopic way, in what remaining sense is the optimizer myopic?
AI safety via market making is one example, but it’s a very tricky one, so maybe not the best candidate for showcasing what I mean. In particular, I suspect that a myopic optimizer given the goal of acting as a trader or market-maker in such a setup wouldn’t act deceptively, though I suspect they would Goodhart on the human approval signal in unsafe ways (which is less bad of a problem than deception, and could potentially be solved via something like my step (6), but still a pretty serious problem).
Maybe a better example would be something like imitative generalization. If imitating HCH is insufficient, we can push further by replacing “imitate HCH” with “output the hypothesis which maximizes HCH’s prior times the hypothesis’s likelihood,” which gets you substantially farther and I think is still safe to optimize for given a myopic optimizer (though neither are safe for a non-myopic optimizer).
Both of these seem to be examples of solutions that simply push the problem back a step, rather than seeking to eliminate it directly. My model of Eliezer would call this attempting to manipulate confusion, and caution that, although adding more gears to your perpetual motion machine might make the physics-violating component harder to pick out, it does not change the fact that somewhere within the model is a step that violates physics.
In this case, it seems as though all of your proposals are of the form “Train your model to imitate some process X (where X is non-myopic and potentially unsafe), while adding incentives in favor of myopic behavior during training.” To which my model of Eliezer replies, “Either your model will end up myopic, and not powerful enough to capture the part of X that actually does the useful work we are interested in, or it ends up imitating X in full (non-myopic) generality, in which case you have not managed to achieve any kind of safety improvement over X proper.”
It seems to me that to usefully refute this, you need to successfully argue against Eliezer’s background premise here—the one about power and non-myopic reasoning going hand-in-hand in a deep manner that, while perhaps circumventable via similarly deep insights, is not patchable via shallow methods like “Instead of directly using dangerous process X, we will imitate X, thereby putting an extra layer of abstraction between ourselves and the danger.” My current impression is that you have not been arguing against this background premise at all, and as such I don’t think your arguments hit at the core of what makes Eliezer doubt your proposals.
How does a “myopic optimizer” successfully reason about problems that require non-myopic solutions, i.e. solutions whose consequences extend past whatever artificial time-frame the optimizer is being constrained to reason about?
It just reasons about them, using deduction, prediction, search, etc., the same way any optimizer would.
To the extent that it does successfully reason about those things in a non-myopic way, in what remaining sense is the optimizer myopic?
The sense that it’s still myopic is in the sense that it’s non-deceptive, which is the only sense that we actually care about.
it ends up imitating X in full (non-myopic) generality, in which case you have not managed to achieve any kind of safety improvement over X proper
The safety improvement that I’m claiming is that it wouldn’t be deceptive. What is the mechanism by which you think a myopic agent would end up acting deceptively?
[Note: Still speaking from my Eliezer model here, in the sense that I am making claims which I do not myself necessarily endorse (though naturally I don’t anti-endorse them either, or else I wouldn’t be arguing them in the first place). I want to highlight here, however, that to the extent that the topic of the conversation moves further away from things I have seen Eliezer talk about, the more I need to guess about what I think he would say, and at some point I think it is fair to describe my claims as neither mine nor (any model of) Eliezer’s, but instead something like my extrapolation of my model of Eliezer, which may not correspond at all to what the real Eliezer thinks.]
> To the extent that it does successfully reason about those things in a non-myopic way, in what remaining sense is the optimizer myopic?
The sense that it’s still myopic is in the sense that it’s non-deceptive, which is the only sense that we actually care about.
> it ends up imitating X in full (non-myopic) generality, in which case you have not managed to achieve any kind of safety improvement over X proper
The safety improvement that I’m claiming is that it wouldn’t be deceptive. What is the mechanism by which you think a myopic agent would end up acting deceptively?
If the underlying process your myopic agent was trained to imitate would (under some set of circumstances) be incentivized to deceive you, and the myopic agent (by hypothesis) imitates the underlying process to sufficient resolution, why would the deceptive behavior of the underlying process not be reflected in the behavior of the myopic agent?
Conversely, if the myopic agent does not learn to imitate the underlying process to sufficient resolution that unwanted behaviors like deception start carrying over, then it is very likely that the powerful consequentialist properties of the underlying process have not been carried over, either. This is because (on my extrapolation of Eliezer’s model) deceptive behavior, like all other instrumental strategies, arises from consequentialist reasoning, and is deeply tied to such reasoning in a way that is not cleanly separable—which is to say, by default, you do not manage to sever one without also severing the other.
Again, I (my model of Eliezer) does not think the “deep tie” in question is necessarily insoluble; perhaps there is some sufficiently clever method which, if used, would successfully filter out the “unwanted” instrumental behavior (“deception”, in your terminology) from the “wanted” instrumental behavior (planning, coming up with strategies, in general being an effective agent in the real world). But this distinction between “wanted” and “unwanted” is not a natural distinction; it is, in fact, a distinction highly entangled with human concepts and human values, and any “filter” that selects based on said distinction will need to be of similar complexity. (Of identical complexity, in fact, to the whole alignment problem.) “Simple” filters like the thing you are calling “myopia” definitely do not suffice to perform this function.
I’d be interested in hearing which aspect(s) of the above model you disagree with, and why.
If the underlying process your myopic agent was trained to imitate would (under some set of circumstances) be incentivized to deceive you, and the myopic agent (by hypothesis) imitates the underlying process to sufficient resolution, why would the deceptive behavior of the underlying process not be reflected in the behavior of the myopic agent?
Yeah, this is obviously true. Certainly if you have an objective of imitating something that would act deceptively, you’ll get deception. The solution isn’t to somehow “filter out the unwanted instrumental behavior from the wanted instrumental behavior,” though, it’s just to not imitate something that would be deceptive.
It’s perhaps worth pointing out why, if we have something to imitate already that isn’t deceptive, why we don’t just run that thing directly—and the answer is that we can’t: all of the sorts of things that might be both competitive and safe to myopically imitate are things like HCH that are too inefficient to run directly.
This is a great thread. Let me see if I can restate the arguments here in different language:
Suppose Bob is a smart guy whom we trust to want all the best things for humanity. Suppose we also have the technology to copy Bob’s brain into software and run it in simulation at, say, a million times its normal speed. Then, if we thought we had one year between now and AGI (leaving aside the fact that I just described a literal AGI in the previous sentence), we could tell simulation-Bob, “You have a million subjective years to think of an effective pivotal act in the real world, and tell us how to execute it.” Bob’s a smart guy, and we trust him to do the right thing by us; he should be able to figure something out in a million years, right?
My understanding of Evan’s argument at this point would be: “Okay; so we don’t have the technology to directly simulate Bob’s brain. But maybe instead we can imitate its I/O signature by training a model against its actions. Then, because that model is software, we can (say) speed it up a million times and deal with it as if it was a high-fidelity copy of Bob’s brain, and it can solve alignment / execute pivotal action / etc. for us. Since Bob was smart, the model of Bob will be smart. And since Bob was trustworthy, the model of Bob will be trustworthy to the extent that the training process we use doesn’t itself introduce novel long-term dependencies that leave room for deception.”
Note that myopia — i.e., the purging of long term dependencies from the training feedback signal — isn’t really conceptually central to the above scheme. Rather it is just a hack intended to prevent additional deception risks from being introduced through the act of copying Bob’s brain. The simulated / imitated copy of Bob is still a full-blown consequentialist, with all the manifold risks that entails. So the scheme is basically a way of taking an impractically weak system that you trust, and overclocking it but not otherwise affecting it, so that it retains (you hope) the properties that made you trust it in the first place.
At this point my understanding of Eliezer’s counterargument would be: “Okay sure; but find me a Bob that you trust enough to actually put through this process. Everything else is neat, but it is downstream of that.” And I think that this is correct and that it is a very, very strong objection, but — under certain sets of assumptions about timelines, alternatives, and counterfactual risks — it may not be a complete knock-down. (This is the “belling the cat” bit, I believe.)
And at this point, maybe (?) Evan says, “But wait; the Bob-copy isn’t actually a consequentialist because it was trained myopically.” And if that’s what Evan says, then I believe this is the point at which there is an empirically resolvable disagreement.
Is this roughly right? Or have I missed something?
Eliezer’s counterargument is “You don’t get a high-fidelity copy of Bob that can be iterated and recursed to do arbitrary amounts of work a Bob-army could do, the way Bob could do it, until many years after the world otherwise ends. The imitated Bobs are imperfect, and if they scale to do vast amounts of work, kill you.”
To be clear, I agree with this as a response to what Edouard said—and I think it’s a legitimate response to anyone proposing we just do straightforward imitative amplification, but I don’t think it’s a response to what I’m advocating for in this post (though to be fair, this post was just a quick sketch, so I suppose I shouldn’t be too surprised that it’s not fully clear).
In my opinion, if you try to imitate Bob and get a model that looks like it behaves similarly to Bob, but no have no other guarantees about it, that’s clearly not a safe model to amplify, and probably not even a safe model to train in the first place. That’s because instead of getting a model that actually cares about imitating Bob or anything like that, you probably just got some pseudo-aligned mesa-optimizer with an objective that produces behavior that happens to correlate well with Bob’s.
However, there does exist a purely theoretical construct—what would happen if you actually amplified Bob, not an imitation of Bob—that is very likely to be safe and superhuman (though probably still not fully competitive, but we’ll put that aside for now since it doesn’t seem to be the part you’re most skeptical of). Thus, if you could somehow get a model that was in fact trying to imitate amplified Bob, you might be okay—except that that’s not true, because most types of agents, when given the objective of imitating a safe thing, will end up with a bunch of convergent instrumental goals that break that safety. However, I claim that there are natural types of agents (that is, not too complex on a simplicity prior) that, when given the objective of imitating a safe thing, do so safely. That’s what I mean by my step (1) above (and of course, even if such natural agents exist, there’s still a lot you have to do to make sure you get them—that’s the rest of the steps).
But since you seem most skeptical of (1), maybe I’ll try to lay out my basic case for how I think we can get a theory of simple, safe imitators (including simple imitators with arbitrary levels of optimization power):
All the really basic concerns—e.g. it tries to get more compute so it can simulate better—can be solved by having a robust Cartesian boundary and having an agent that optimizes an objective defined on actions through the boundary (similarly to why an approval-directed agent wouldn’t do this sort of thing—the main problem with approval-directed agents just being that human approval is not a very good thing to optimize for).
Specifying a robust Cartesian boundary is not that hard—you just need a good multi-level world-model, which any powerful agent should have to have anyway.
There are remaining issues related to superrationality, but those can be avoided by having a decision theory that ignores them (e.g. the right sort of CDT variant).
There are also some remaining issues related to tiling, but those can be avoided if the Cartesian boundary is structured in such a way that it excludes other agents (this is exactly the trick that LCDT pulls).
All the really basic concerns—e.g. it tries to get more compute so it can simulate better—can be solved by having a robust Cartesian boundary and having an agent that optimizes an objective defined on actions through the boundary
I’m confused from several directions here. What is a “robust” Cartesian boundary, why do you think this stops an agent from trying to get more compute, and when you postulate “an agent that optimizes an objective” are you imagining something much more like an old chess-playing system with a known objective than a modern ML system with a loss function?
are you imagining something much more like an old chess-playing system with a known objective than a modern ML system with a loss function?
No—I’m separating out two very important pieces that go into training a machine learning model: what sort of model you want to get and how you’re going to get it. My step (1) above, which is what I understand that we’re talking about, is just about that first piece: understanding what we’re going to be shooting for when we set up our training process (and then once we know what we’re shooting for we can think about how to set up a training process to actually land there). See “How do we become confident in the safety of a machine learning system?” for understanding this way of thinking about ML systems.
It’s worth pointing out, however, that even when we’re just focusing on that first part, it’s very important that we pay attention to the total complexity that we’re paying in specifying what sort of model we want, since that’s going to determine a lot of how difficult it will be to actually construct a training process that produces such a model. Exactly what sort of complexity we should be paying attention to is a bit unclear, but I think that the best model we currently have of neural network inductive biases is something like a simplicity prior with a speed cap (see here for some empirical evidence for this).
What is a “robust” Cartesian boundary, why do you think this stops an agent from trying to get more compute
Broadly speaking, I’d say that a Cartesian boundary is robust if the agent has essentially the same concept of what its action, observation, etc. is regardless of what additional true facts it learns about the world.
The Cartesian boundary itself does nothing to prevent an agent from trying to get more compute to simulate better, but having an objective that’s just specified in terms of actions rather than world states does. If you want a nice simple proof of this, Alex Turner wrote one up here (and discusses it a bit more here), which demonstrates that instrumental convergence disappears when you have an objective specified in terms of action-observation histories rather than world states.
Like I said above, however, there are still some remaining problems—just having an objective specified in terms of actions isn’t quite enough.
Thanks, that helps. So actually this objection says: “No, the biggest risk lies not in the trustworthiness of the Bob you use as the input to your scheme, but rather in the fidelity of your copying process; and this is true even if the errors in your copying process are being introduced randomly rather than adversarially. Moreover, if you actually do develop the technical capability to reduce your random copying-error risk down to around the level of your Bob-trustworthiness risk, well guess what, you’ve built yourself an AGI. But since this myopic copying scheme thing seems way harder than the easiest way I can think of to build an AGI, that means a fortiori that somebody else built one the easy way several years before you built yours.”
Closer, yeah. In the limit of doing insanely complicated things with Bob you will start to break him even if he is faithfully simulated, you will be doing things that would break the actual Bob; but I think HCH schemes fail long before they get to that point.
Abstracting out one step: there is a rough general argument that human-imitating AI is, if not perfectly safe, then at least as safe as the humans it’s imitating. In particular, if it’s imitating humans working on alignment, then it’s at least as likely as we are to come up with an aligned AI. Its prospects are no worse than our prospects are already. (And plausibly better, since the simulated humans may have more time to solve the problem.)
For full strength, this argument requires that:
It emulate the kind of alignment research which the actual humans would do, rather than some other kind of work
It correctly imitates the humans
Once we relax either of those assumptions, the argument gets riskier. A relaxation of the first assumption would be e.g. using HCH in place of humans working normally on the problem for a while (I expect this would not work nearly as well as the actual humans doing normal research, in terms of both safety and capability). The second assumption is where inner alignment problems and Evan’s work enter the picture.
The solution isn’t to somehow “filter out the unwanted instrumental behavior from the wanted instrumental behavior,” though, it’s just to not imitate something that would be deceptive.
Okay, I think this helps me understand your view better. Specifically, my initial characterization of your proposals as “Imitate a (non-myopic, potentially unsafe) process X” should be amended to “Imitate a (non-myopic, but nonetheless safe) process X,” where the reason to do the imitation isn’t necessarily to buy anything extra in terms of safety, but simply efficiency.
If this is the case, however, then it raises two more questions in my view:
(less important) What work is “myopia” doing in the training process? Is the point of running a myopic imitator also just to buy efficiency (how so?), or is “myopia” somehow producing additional safety gains on top of the (already-postulated-to-exist) safety inherent to the underlying (non-deceptive) process?
(more important) It still seems to me that you are presupposing the existence of a (relatively easy-to-get) process that is simultaneously “powerful” and “non-deceptive” (“competitive and safe”, to use the same words you used in your most recent response). Again, on my view (of Eliezer’s view), this is something that is not easy to get; either the cognitive process you are working with is general enough to consider instrumental strategies (out of which deception naturally emerges as a special case), or it is not; this holds unless there is some special “filter” in place that specifically separates wanted instrumental strategies from unwanted ones. I would still like to know if/how you disagree with this, particularly as it concerns things like e.g. HCH, which (based on your previous comments) you seem to view as an example of a “competitive but safe” process.
Specifically, my initial characterization of your proposals as “Imitate a (non-myopic, potentially unsafe) process X” should be amended to “Imitate a (non-myopic, but nonetheless safe) process X,” where the reason to do the imitation isn’t necessarily to buy anything extra in terms of safety, but simply efficiency.
My model of Evan is gonna jump in here (and he can correct me if I’m wrong), see if it helps….
I like the first part, but I don’t think the “simply efficiency” part is correct.
Instead I think, actually training a model involves real-world model-training things like “running gradient descent on GPUs”. But Process X doesn’t have to involve “running gradient descent on GPUs”. Process X can be a human in the real world, or some process existing in a platonic sandbox, or whatever.
If we train a model to be myopically imitating every step of Process X, we get non-myopia in Process X’s world (e.g. the world of the human making their human plans), but we get myopia in regards to “running gradient descent on GPUs” and such.
I think Evan is using a specific sense of “deception” which is intimately related to “running gradient descent on GPUs”, so he can declare victory over (this form of) “deception”.
(Unless, I guess, instead of imitating the steps of safe non-myopic Process X, we accidentally imitate the steps of dangerous non-myopic Process Y, which is so clever that it figures out that it’s running in a simulation and tries to hack into base reality, or whatever.)
In other words, the reason to do the myopic imitation is that (non-myopic but nevertheless safe) process X is not a trained model, it’s an idea, or ideal. We want to get from there to a trained model without introducing new safety problems in the process.
(Not agreeing or disagreeing with any of this, just probing my understanding.)
Yeah, that’s right, I definitely agree with (1) and disagree with (2).
I tend to think that HCH is not dangerous, but I agree that it’s likely insufficiently capable. To solve that problem, we have to do go to a myopic objective that is more powerful. But that’s not that hard, and there’s lots of them that can incentivize good non-myopic behavior that are safe to optimize for as long as the optimizer is myopic.
AI safety via market making is one example, but it’s a very tricky one, so maybe not the best candidate for showcasing what I mean. In particular, I suspect that a myopic optimizer given the goal of acting as a trader or market-maker in such a setup wouldn’t act deceptively, though I suspect they would Goodhart on the human approval signal in unsafe ways (which is less bad of a problem than deception, and could potentially be solved via something like my step (6), but still a pretty serious problem).
Maybe a better example would be something like imitative generalization. If imitating HCH is insufficient, we can push further by replacing “imitate HCH” with “output the hypothesis which maximizes HCH’s prior times the hypothesis’s likelihood,” which gets you substantially farther and I think is still safe to optimize for given a myopic optimizer (though neither are safe for a non-myopic optimizer).
It still doesn’t seem to me like you’ve sufficiently answered the objection here.
What if any sufficiently powerful objective is non-myopic? Or, on a different-but-equivalent phrasing: what if myopia is a property only of very specific toy objectives, rather than a widespread property of objectives in general (including objectives that humans would intuitively consider to be aimed at accomplishing things “in the real world”)?
It seems to me that Eliezer has presented quite compelling arguments that the above is the case, and on a first pass it doesn’t look to me like you’ve countered those arguments.
How does a “myopic optimizer” successfully reason about problems that require non-myopic solutions, i.e. solutions whose consequences extend past whatever artificial time-frame the optimizer is being constrained to reason about? To the extent that it does successfully reason about those things in a non-myopic way, in what remaining sense is the optimizer myopic?
Both of these seem to be examples of solutions that simply push the problem back a step, rather than seeking to eliminate it directly. My model of Eliezer would call this attempting to manipulate confusion, and caution that, although adding more gears to your perpetual motion machine might make the physics-violating component harder to pick out, it does not change the fact that somewhere within the model is a step that violates physics.
In this case, it seems as though all of your proposals are of the form “Train your model to imitate some process X (where X is non-myopic and potentially unsafe), while adding incentives in favor of myopic behavior during training.” To which my model of Eliezer replies, “Either your model will end up myopic, and not powerful enough to capture the part of X that actually does the useful work we are interested in, or it ends up imitating X in full (non-myopic) generality, in which case you have not managed to achieve any kind of safety improvement over X proper.”
It seems to me that to usefully refute this, you need to successfully argue against Eliezer’s background premise here—the one about power and non-myopic reasoning going hand-in-hand in a deep manner that, while perhaps circumventable via similarly deep insights, is not patchable via shallow methods like “Instead of directly using dangerous process X, we will imitate X, thereby putting an extra layer of abstraction between ourselves and the danger.” My current impression is that you have not been arguing against this background premise at all, and as such I don’t think your arguments hit at the core of what makes Eliezer doubt your proposals.
It just reasons about them, using deduction, prediction, search, etc., the same way any optimizer would.
The sense that it’s still myopic is in the sense that it’s non-deceptive, which is the only sense that we actually care about.
The safety improvement that I’m claiming is that it wouldn’t be deceptive. What is the mechanism by which you think a myopic agent would end up acting deceptively?
[Note: Still speaking from my Eliezer model here, in the sense that I am making claims which I do not myself necessarily endorse (though naturally I don’t anti-endorse them either, or else I wouldn’t be arguing them in the first place). I want to highlight here, however, that to the extent that the topic of the conversation moves further away from things I have seen Eliezer talk about, the more I need to guess about what I think he would say, and at some point I think it is fair to describe my claims as neither mine nor (any model of) Eliezer’s, but instead something like my extrapolation of my model of Eliezer, which may not correspond at all to what the real Eliezer thinks.]
If the underlying process your myopic agent was trained to imitate would (under some set of circumstances) be incentivized to deceive you, and the myopic agent (by hypothesis) imitates the underlying process to sufficient resolution, why would the deceptive behavior of the underlying process not be reflected in the behavior of the myopic agent?
Conversely, if the myopic agent does not learn to imitate the underlying process to sufficient resolution that unwanted behaviors like deception start carrying over, then it is very likely that the powerful consequentialist properties of the underlying process have not been carried over, either. This is because (on my extrapolation of Eliezer’s model) deceptive behavior, like all other instrumental strategies, arises from consequentialist reasoning, and is deeply tied to such reasoning in a way that is not cleanly separable—which is to say, by default, you do not manage to sever one without also severing the other.
Again, I (my model of Eliezer) does not think the “deep tie” in question is necessarily insoluble; perhaps there is some sufficiently clever method which, if used, would successfully filter out the “unwanted” instrumental behavior (“deception”, in your terminology) from the “wanted” instrumental behavior (planning, coming up with strategies, in general being an effective agent in the real world). But this distinction between “wanted” and “unwanted” is not a natural distinction; it is, in fact, a distinction highly entangled with human concepts and human values, and any “filter” that selects based on said distinction will need to be of similar complexity. (Of identical complexity, in fact, to the whole alignment problem.) “Simple” filters like the thing you are calling “myopia” definitely do not suffice to perform this function.
I’d be interested in hearing which aspect(s) of the above model you disagree with, and why.
Yeah, this is obviously true. Certainly if you have an objective of imitating something that would act deceptively, you’ll get deception. The solution isn’t to somehow “filter out the unwanted instrumental behavior from the wanted instrumental behavior,” though, it’s just to not imitate something that would be deceptive.
It’s perhaps worth pointing out why, if we have something to imitate already that isn’t deceptive, why we don’t just run that thing directly—and the answer is that we can’t: all of the sorts of things that might be both competitive and safe to myopically imitate are things like HCH that are too inefficient to run directly.
This is a great thread. Let me see if I can restate the arguments here in different language:
Suppose Bob is a smart guy whom we trust to want all the best things for humanity. Suppose we also have the technology to copy Bob’s brain into software and run it in simulation at, say, a million times its normal speed. Then, if we thought we had one year between now and AGI (leaving aside the fact that I just described a literal AGI in the previous sentence), we could tell simulation-Bob, “You have a million subjective years to think of an effective pivotal act in the real world, and tell us how to execute it.” Bob’s a smart guy, and we trust him to do the right thing by us; he should be able to figure something out in a million years, right?
My understanding of Evan’s argument at this point would be: “Okay; so we don’t have the technology to directly simulate Bob’s brain. But maybe instead we can imitate its I/O signature by training a model against its actions. Then, because that model is software, we can (say) speed it up a million times and deal with it as if it was a high-fidelity copy of Bob’s brain, and it can solve alignment / execute pivotal action / etc. for us. Since Bob was smart, the model of Bob will be smart. And since Bob was trustworthy, the model of Bob will be trustworthy to the extent that the training process we use doesn’t itself introduce novel long-term dependencies that leave room for deception.”
Note that myopia — i.e., the purging of long term dependencies from the training feedback signal — isn’t really conceptually central to the above scheme. Rather it is just a hack intended to prevent additional deception risks from being introduced through the act of copying Bob’s brain. The simulated / imitated copy of Bob is still a full-blown consequentialist, with all the manifold risks that entails. So the scheme is basically a way of taking an impractically weak system that you trust, and overclocking it but not otherwise affecting it, so that it retains (you hope) the properties that made you trust it in the first place.
At this point my understanding of Eliezer’s counterargument would be: “Okay sure; but find me a Bob that you trust enough to actually put through this process. Everything else is neat, but it is downstream of that.” And I think that this is correct and that it is a very, very strong objection, but — under certain sets of assumptions about timelines, alternatives, and counterfactual risks — it may not be a complete knock-down. (This is the “belling the cat” bit, I believe.)
And at this point, maybe (?) Evan says, “But wait; the Bob-copy isn’t actually a consequentialist because it was trained myopically.” And if that’s what Evan says, then I believe this is the point at which there is an empirically resolvable disagreement.
Is this roughly right? Or have I missed something?
Eliezer’s counterargument is “You don’t get a high-fidelity copy of Bob that can be iterated and recursed to do arbitrary amounts of work a Bob-army could do, the way Bob could do it, until many years after the world otherwise ends. The imitated Bobs are imperfect, and if they scale to do vast amounts of work, kill you.”
To be clear, I agree with this as a response to what Edouard said—and I think it’s a legitimate response to anyone proposing we just do straightforward imitative amplification, but I don’t think it’s a response to what I’m advocating for in this post (though to be fair, this post was just a quick sketch, so I suppose I shouldn’t be too surprised that it’s not fully clear).
In my opinion, if you try to imitate Bob and get a model that looks like it behaves similarly to Bob, but no have no other guarantees about it, that’s clearly not a safe model to amplify, and probably not even a safe model to train in the first place. That’s because instead of getting a model that actually cares about imitating Bob or anything like that, you probably just got some pseudo-aligned mesa-optimizer with an objective that produces behavior that happens to correlate well with Bob’s.
However, there does exist a purely theoretical construct—what would happen if you actually amplified Bob, not an imitation of Bob—that is very likely to be safe and superhuman (though probably still not fully competitive, but we’ll put that aside for now since it doesn’t seem to be the part you’re most skeptical of). Thus, if you could somehow get a model that was in fact trying to imitate amplified Bob, you might be okay—except that that’s not true, because most types of agents, when given the objective of imitating a safe thing, will end up with a bunch of convergent instrumental goals that break that safety. However, I claim that there are natural types of agents (that is, not too complex on a simplicity prior) that, when given the objective of imitating a safe thing, do so safely. That’s what I mean by my step (1) above (and of course, even if such natural agents exist, there’s still a lot you have to do to make sure you get them—that’s the rest of the steps).
But since you seem most skeptical of (1), maybe I’ll try to lay out my basic case for how I think we can get a theory of simple, safe imitators (including simple imitators with arbitrary levels of optimization power):
All the really basic concerns—e.g. it tries to get more compute so it can simulate better—can be solved by having a robust Cartesian boundary and having an agent that optimizes an objective defined on actions through the boundary (similarly to why an approval-directed agent wouldn’t do this sort of thing—the main problem with approval-directed agents just being that human approval is not a very good thing to optimize for).
Specifying a robust Cartesian boundary is not that hard—you just need a good multi-level world-model, which any powerful agent should have to have anyway.
There are remaining issues related to superrationality, but those can be avoided by having a decision theory that ignores them (e.g. the right sort of CDT variant).
There are also some remaining issues related to tiling, but those can be avoided if the Cartesian boundary is structured in such a way that it excludes other agents (this is exactly the trick that LCDT pulls).
I’m confused from several directions here. What is a “robust” Cartesian boundary, why do you think this stops an agent from trying to get more compute, and when you postulate “an agent that optimizes an objective” are you imagining something much more like an old chess-playing system with a known objective than a modern ML system with a loss function?
No—I’m separating out two very important pieces that go into training a machine learning model: what sort of model you want to get and how you’re going to get it. My step (1) above, which is what I understand that we’re talking about, is just about that first piece: understanding what we’re going to be shooting for when we set up our training process (and then once we know what we’re shooting for we can think about how to set up a training process to actually land there). See “How do we become confident in the safety of a machine learning system?” for understanding this way of thinking about ML systems.
It’s worth pointing out, however, that even when we’re just focusing on that first part, it’s very important that we pay attention to the total complexity that we’re paying in specifying what sort of model we want, since that’s going to determine a lot of how difficult it will be to actually construct a training process that produces such a model. Exactly what sort of complexity we should be paying attention to is a bit unclear, but I think that the best model we currently have of neural network inductive biases is something like a simplicity prior with a speed cap (see here for some empirical evidence for this).
Broadly speaking, I’d say that a Cartesian boundary is robust if the agent has essentially the same concept of what its action, observation, etc. is regardless of what additional true facts it learns about the world.
The Cartesian boundary itself does nothing to prevent an agent from trying to get more compute to simulate better, but having an objective that’s just specified in terms of actions rather than world states does. If you want a nice simple proof of this, Alex Turner wrote one up here (and discusses it a bit more here), which demonstrates that instrumental convergence disappears when you have an objective specified in terms of action-observation histories rather than world states.
Like I said above, however, there are still some remaining problems—just having an objective specified in terms of actions isn’t quite enough.
Thanks, that helps. So actually this objection says: “No, the biggest risk lies not in the trustworthiness of the Bob you use as the input to your scheme, but rather in the fidelity of your copying process; and this is true even if the errors in your copying process are being introduced randomly rather than adversarially. Moreover, if you actually do develop the technical capability to reduce your random copying-error risk down to around the level of your Bob-trustworthiness risk, well guess what, you’ve built yourself an AGI. But since this myopic copying scheme thing seems way harder than the easiest way I can think of to build an AGI, that means a fortiori that somebody else built one the easy way several years before you built yours.”
Is that an accurate interpretation?
Closer, yeah. In the limit of doing insanely complicated things with Bob you will start to break him even if he is faithfully simulated, you will be doing things that would break the actual Bob; but I think HCH schemes fail long before they get to that point.
Gotcha. Well, that seems right—certainly in the limit case.
Abstracting out one step: there is a rough general argument that human-imitating AI is, if not perfectly safe, then at least as safe as the humans it’s imitating. In particular, if it’s imitating humans working on alignment, then it’s at least as likely as we are to come up with an aligned AI. Its prospects are no worse than our prospects are already. (And plausibly better, since the simulated humans may have more time to solve the problem.)
For full strength, this argument requires that:
It emulate the kind of alignment research which the actual humans would do, rather than some other kind of work
It correctly imitates the humans
Once we relax either of those assumptions, the argument gets riskier. A relaxation of the first assumption would be e.g. using HCH in place of humans working normally on the problem for a while (I expect this would not work nearly as well as the actual humans doing normal research, in terms of both safety and capability). The second assumption is where inner alignment problems and Evan’s work enter the picture.
Okay, I think this helps me understand your view better. Specifically, my initial characterization of your proposals as “Imitate a (non-myopic, potentially unsafe) process X” should be amended to “Imitate a (non-myopic, but nonetheless safe) process X,” where the reason to do the imitation isn’t necessarily to buy anything extra in terms of safety, but simply efficiency.
If this is the case, however, then it raises two more questions in my view:
(less important) What work is “myopia” doing in the training process? Is the point of running a myopic imitator also just to buy efficiency (how so?), or is “myopia” somehow producing additional safety gains on top of the (already-postulated-to-exist) safety inherent to the underlying (non-deceptive) process?
(more important) It still seems to me that you are presupposing the existence of a (relatively easy-to-get) process that is simultaneously “powerful” and “non-deceptive” (“competitive and safe”, to use the same words you used in your most recent response). Again, on my view (of Eliezer’s view), this is something that is not easy to get; either the cognitive process you are working with is general enough to consider instrumental strategies (out of which deception naturally emerges as a special case), or it is not; this holds unless there is some special “filter” in place that specifically separates wanted instrumental strategies from unwanted ones. I would still like to know if/how you disagree with this, particularly as it concerns things like e.g. HCH, which (based on your previous comments) you seem to view as an example of a “competitive but safe” process.
My model of Evan is gonna jump in here (and he can correct me if I’m wrong), see if it helps….
I like the first part, but I don’t think the “simply efficiency” part is correct.
Instead I think, actually training a model involves real-world model-training things like “running gradient descent on GPUs”. But Process X doesn’t have to involve “running gradient descent on GPUs”. Process X can be a human in the real world, or some process existing in a platonic sandbox, or whatever.
If we train a model to be myopically imitating every step of Process X, we get non-myopia in Process X’s world (e.g. the world of the human making their human plans), but we get myopia in regards to “running gradient descent on GPUs” and such.
I think Evan is using a specific sense of “deception” which is intimately related to “running gradient descent on GPUs”, so he can declare victory over (this form of) “deception”.
(Unless, I guess, instead of imitating the steps of safe non-myopic Process X, we accidentally imitate the steps of dangerous non-myopic Process Y, which is so clever that it figures out that it’s running in a simulation and tries to hack into base reality, or whatever.)
In other words, the reason to do the myopic imitation is that (non-myopic but nevertheless safe) process X is not a trained model, it’s an idea, or ideal. We want to get from there to a trained model without introducing new safety problems in the process.
(Not agreeing or disagreeing with any of this, just probing my understanding.)