The notion of (1) seems like the cat-belling problem here; the other steps don’t seem interesting by comparison, the equivalent of talking about all the neat things to do after belling the cat.
What pivotal act is this AGI supposed to be executing? Designing a medium-strong nanosystem? How would you do that via a myopic system? That means the AGI needs to design a nanosystem whose purpose spans over time and whose current execution has distant good consequences. It doesn’t matter whether you claim it’s being done by something that internally looks like myopic HCH any more than it matters that it’s being done by internal transistors that don’t have tiny models of the future inside themselves. What’s consequentialist and farseeing isn’t the transistors, or the floating-point multiplications, or the elaborate HCH or whatever, it’s the actual work and actual problem being solved by the system whereby it produces a nanosystem that has coherent effects on the physical world spanning hours and days.
The notion of (1) seems like the cat-belling problem here; the other steps don’t seem interesting by comparison, the equivalent of talking about all the neat things to do after belling the cat.
I’m surprised that you think (1) is the hard part—though (1) is what I’m currently working on, since I think it’s necessary to make a lot of the other parts go through, I expect it to be one of the easiest parts of the story to make work.
What pivotal act is this AGI supposed to be executing? Designing a medium-strong nanosystem?
I left this part purposefully vague, but I’m happy to accept designing a medium-strong nanosystem as the pivotal act to consider here for the sake of argument, since I think that if your advanced AI can’t at least do that, then it probably can’t do anything else pivotal either.
That means the AGI needs to design a nanosystem whose purpose spans over time and whose current execution has distant good consequences.
Agreed.
It doesn’t matter whether you claim it’s being done by something that internally looks like myopic HCH any more than it matters that it’s being done by internal transistors that don’t have tiny models of the future inside themselves.
I think this is where you misunderstand me. I suspect that you don’t really understand what I mean by myopia.
Let me see if I can explain, just using the HCH example. Though I suspect that imitating HCH is actually not powerful enough to do a pivotal act—and I suspect you agree—it’s a perfectly good example to showcase what I mean by myopia.
To start with, the optimization wouldn’t be done by HCH, or anything that would internally look like HCH in the slightest—rather, the optimization would be done by whatever powerful optimization process is inside of our model. Where myopia comes into play is in what goal we’re trying to direct that optimization towards. The key idea, in the case of HCH, would be to direct that optimization towards the goal of producing an action that is maximally close to what HCH would do. In such a situation, you would have a model that can use its own powerful internal optimization procedures to imitate what HCH would do as effectively as possible—able to do things like effectively manage cognitive resources and reason about how best to go about producing an action that is as close as possible to HCH.
The natural class that I think this example is pointing to is the class of optimizers that optimize for an objective that is exclusively about their action through a Cartesian boundary, rather than the consequences of their action on the world. Such optimizers can still end up producing actions with far-reaching consequences on the world if they deploy their optimization power in the service of an objective like imitating HCH that requires producing actions with particular consequences, however. In such a situation, the model would be actively doing lots of reasoning about the consequences of its actions on the world, but not for the goal of producing a particular consequence, but rather just for the goal of producing a particular action, e.g. the one that matches up to what HCH would do. Thus, optimizers of this form can do all sorts of extremely powerful, long-term, non-myopic tasks—but without ever having any incentive to act deceptively.
Notably, there are a bunch of nuances here, regarding things like ensuring that the agent doesn’t end up optimizing its objective non-myopically because of acausal trade considerations, ensuring that it doesn’t want to self-modify into a different sort of agent, making sure it doesn’t just spin up other agents that act non-myopically, etc., but these problems are really not that hard to solve. As a proof of concept, LCDT definitely solves all of these problems, showcasing that an optimizing system that really “just imitates HCH” is possible. Unfortunately, LCDT is not quite as natural as I would like, since it requires paying a bunch of bits of complexity to specify a fundamental concept of an “agent,” such that I don’t think that the final solution here will actually look much like LCDT. Rather, I suspect that a more natural class of myopic agents will come from something more like analyzing the general properties of different types of optimizers over Cartesian boundaries.
Regardless, I strongly doubt that just developing a proper notion of myopia here poses a fundamental obstacle—we already have evidence that optimizers of this form are possible and can have the desired properties, and the basic concept of “an optimizer that just cares about its next action” is natural enough that I’d be quite surprised if we couldn’t fully systematize it. I do suspect that any systematization will require paying the complexity of specifying a Cartesian boundary, but I’d be quite surprised if that cost us enough complexity to make the desired class too unnatural.
What’s consequentialist and farseeing isn’t the transistors, or the floating-point multiplications, or the elaborate HCH or whatever, it’s the actual work and actual problem being solved by the system whereby it produces a nanosystem that has coherent effects on the physical world spanning hours and days.
Certainly it doesn’t matter what substrate the computation is running on. I don’t think this is really engaging with anything that I’m saying.
Certainly it doesn’t matter what substrate the computation is running on.
I read Yudkowsky as positing some kind of conservation law. Something like, if the plans produced by your AI succeed at having specifically chosen far-reaching consequences if implemented, then the AI must have done reasoning about far-reaching consequences. Then (I’m guessing) Yudkowsky is applying that conservation law to [a big assemblage of myopic reasoners which outputs far-reaching plans], and concluding that either the reasoners weren’t myopic, or else the assemblage implements a non-myopic reasoner with the myopic reasoners as a (mere) substrate.
Reasoning correctly about far-reaching consequences by default (1) has mistargeted consequences, and (2) is done by summoning a dangerous reasoner.
Such optimizers can still end up producing actions with far-reaching consequences on the world if they deploy their optimization power in the service of an objective like imitating HCH that requires producing actions with particular consequences, however.
I think what you’re saying here implies that you think it is feasible to assemble myopic reasoners into a non-myopic reasoner, without compromising safety. My possibly straw understanding, is that the way this is supposed to happen in HCH is that, basically, the humans providing the feedback train the imitator(s) to implement a collective message-passing algorithm that answers any reasonable question or whatever. This sounds like a non-answer, i.e. it’s just saying ”...and then the humans somehow assemble myopic reasoners into a non-myopic reasoner”. Where’s the non-myopicness? If there’s non-myopicness happening in each step of the human consulting HCH, then the imitator is imitating a non-myopic reasoner and so is non-myopic (and this is compounded by distillation steps). If there isn’t non-myopicness happening in each step, how does it come in to the assembly?
Something like, if the plans produced by your AI succeed at having specifically chosen far-reaching consequences if implemented, then the AI must have done reasoning about far-reaching consequences. Then (I’m guessing) Yudkowsky is applying that conservation law to [a big assemblage of myopic reasoners which outputs far-reaching plans], and concluding that either the reasoners weren’t myopic, or else the assemblage implements a non-myopic reasoner with the myopic reasoners as a (mere) substrate.
To be clear, I agree with this also, but don’t think it’s really engaging with what I’m advocating for—I’m not proposing any sort of assemblage of reasoners; I’m not really sure where that misconception came from.
I don’t think the assemblage is the point. I think the idea here is that “myopia” is a property of problems: a non-myopic problem is (roughly) one which inherently requires doing things with long time horizons. I think Eliezer’s claim is that (1) a (good) pivotal act is probably a non-myopic problem, and (2) you can’t solve a nontrivial nonmyopic problem with a myopic solver. Part (2) is what I think TekhneMakr is gesturing at and Eliezer is endorsing.
My guess is that you have some idea of how a myopic solver can solve a nonmyopic problem (by having it output whatever HCH would do, for instance). And then Eliezer would probably reply that the non-myopia has been wrapped up somewhere else (e.g. in HCH), and that has become the dangerous part (or, more realistically, the insufficiently capable part, and I expect Eliezer would claim that replacing it with something both sufficiently capable and aligned is about as hard as the whole alignment problem). I’m not sure what your response would be to that.
(1) a (good) pivotal act is probably a non-myopic problem, and (2) you can’t solve a nontrivial nonmyopic problem with a myopic solver. [...] My guess is that you have some idea of how a myopic solver can solve a nonmyopic problem (by having it output whatever HCH would do, for instance).
Yeah, that’s right, I definitely agree with (1) and disagree with (2).
And then Eliezer would probably reply that the non-myopia has been wrapped up somewhere else (e.g. in HCH), and that has become the dangerous part (or, more realistically, the insufficiently capable part, and I expect Eliezer would claim that replacing it with something both sufficiently capable and aligned is about as hard as the whole alignment problem).
I tend to think that HCH is not dangerous, but I agree that it’s likely insufficiently capable. To solve that problem, we have to do go to a myopic objective that is more powerful. But that’s not that hard, and there’s lots of them that can incentivize good non-myopic behavior that are safe to optimize for as long as the optimizer is myopic.
AI safety via market making is one example, but it’s a very tricky one, so maybe not the best candidate for showcasing what I mean. In particular, I suspect that a myopic optimizer given the goal of acting as a trader or market-maker in such a setup wouldn’t act deceptively, though I suspect they would Goodhart on the human approval signal in unsafe ways (which is less bad of a problem than deception, and could potentially be solved via something like my step (6), but still a pretty serious problem).
Maybe a better example would be something like imitative generalization. If imitating HCH is insufficient, we can push further by replacing “imitate HCH” with “output the hypothesis which maximizes HCH’s prior times the hypothesis’s likelihood,” which gets you substantially farther and I think is still safe to optimize for given a myopic optimizer (though neither are safe for a non-myopic optimizer).
It still doesn’t seem to me like you’ve sufficiently answered the objection here.
I tend to think that HCH is not dangerous, but I agree that it’s likely insufficiently capable. To solve that problem, we have to do go to a myopic objective that is more powerful.
What if any sufficiently powerful objective is non-myopic? Or, on a different-but-equivalent phrasing: what if myopia is a property only of very specific toy objectives, rather than a widespread property of objectives in general (including objectives that humans would intuitively consider to be aimed at accomplishing things “in the real world”)?
It seems to me that Eliezer has presented quite compelling arguments that the above is the case, and on a first pass it doesn’t look to me like you’ve countered those arguments.
But that’s not that hard, and there’s lots of them that can incentivize good non-myopic behavior that are safe to optimize for as long as the optimizer is myopic.
How does a “myopic optimizer” successfully reason about problems that require non-myopic solutions, i.e. solutions whose consequences extend past whatever artificial time-frame the optimizer is being constrained to reason about? To the extent that it does successfully reason about those things in a non-myopic way, in what remaining sense is the optimizer myopic?
AI safety via market making is one example, but it’s a very tricky one, so maybe not the best candidate for showcasing what I mean. In particular, I suspect that a myopic optimizer given the goal of acting as a trader or market-maker in such a setup wouldn’t act deceptively, though I suspect they would Goodhart on the human approval signal in unsafe ways (which is less bad of a problem than deception, and could potentially be solved via something like my step (6), but still a pretty serious problem).
Maybe a better example would be something like imitative generalization. If imitating HCH is insufficient, we can push further by replacing “imitate HCH” with “output the hypothesis which maximizes HCH’s prior times the hypothesis’s likelihood,” which gets you substantially farther and I think is still safe to optimize for given a myopic optimizer (though neither are safe for a non-myopic optimizer).
Both of these seem to be examples of solutions that simply push the problem back a step, rather than seeking to eliminate it directly. My model of Eliezer would call this attempting to manipulate confusion, and caution that, although adding more gears to your perpetual motion machine might make the physics-violating component harder to pick out, it does not change the fact that somewhere within the model is a step that violates physics.
In this case, it seems as though all of your proposals are of the form “Train your model to imitate some process X (where X is non-myopic and potentially unsafe), while adding incentives in favor of myopic behavior during training.” To which my model of Eliezer replies, “Either your model will end up myopic, and not powerful enough to capture the part of X that actually does the useful work we are interested in, or it ends up imitating X in full (non-myopic) generality, in which case you have not managed to achieve any kind of safety improvement over X proper.”
It seems to me that to usefully refute this, you need to successfully argue against Eliezer’s background premise here—the one about power and non-myopic reasoning going hand-in-hand in a deep manner that, while perhaps circumventable via similarly deep insights, is not patchable via shallow methods like “Instead of directly using dangerous process X, we will imitate X, thereby putting an extra layer of abstraction between ourselves and the danger.” My current impression is that you have not been arguing against this background premise at all, and as such I don’t think your arguments hit at the core of what makes Eliezer doubt your proposals.
How does a “myopic optimizer” successfully reason about problems that require non-myopic solutions, i.e. solutions whose consequences extend past whatever artificial time-frame the optimizer is being constrained to reason about?
It just reasons about them, using deduction, prediction, search, etc., the same way any optimizer would.
To the extent that it does successfully reason about those things in a non-myopic way, in what remaining sense is the optimizer myopic?
The sense that it’s still myopic is in the sense that it’s non-deceptive, which is the only sense that we actually care about.
it ends up imitating X in full (non-myopic) generality, in which case you have not managed to achieve any kind of safety improvement over X proper
The safety improvement that I’m claiming is that it wouldn’t be deceptive. What is the mechanism by which you think a myopic agent would end up acting deceptively?
[Note: Still speaking from my Eliezer model here, in the sense that I am making claims which I do not myself necessarily endorse (though naturally I don’t anti-endorse them either, or else I wouldn’t be arguing them in the first place). I want to highlight here, however, that to the extent that the topic of the conversation moves further away from things I have seen Eliezer talk about, the more I need to guess about what I think he would say, and at some point I think it is fair to describe my claims as neither mine nor (any model of) Eliezer’s, but instead something like my extrapolation of my model of Eliezer, which may not correspond at all to what the real Eliezer thinks.]
> To the extent that it does successfully reason about those things in a non-myopic way, in what remaining sense is the optimizer myopic?
The sense that it’s still myopic is in the sense that it’s non-deceptive, which is the only sense that we actually care about.
> it ends up imitating X in full (non-myopic) generality, in which case you have not managed to achieve any kind of safety improvement over X proper
The safety improvement that I’m claiming is that it wouldn’t be deceptive. What is the mechanism by which you think a myopic agent would end up acting deceptively?
If the underlying process your myopic agent was trained to imitate would (under some set of circumstances) be incentivized to deceive you, and the myopic agent (by hypothesis) imitates the underlying process to sufficient resolution, why would the deceptive behavior of the underlying process not be reflected in the behavior of the myopic agent?
Conversely, if the myopic agent does not learn to imitate the underlying process to sufficient resolution that unwanted behaviors like deception start carrying over, then it is very likely that the powerful consequentialist properties of the underlying process have not been carried over, either. This is because (on my extrapolation of Eliezer’s model) deceptive behavior, like all other instrumental strategies, arises from consequentialist reasoning, and is deeply tied to such reasoning in a way that is not cleanly separable—which is to say, by default, you do not manage to sever one without also severing the other.
Again, I (my model of Eliezer) does not think the “deep tie” in question is necessarily insoluble; perhaps there is some sufficiently clever method which, if used, would successfully filter out the “unwanted” instrumental behavior (“deception”, in your terminology) from the “wanted” instrumental behavior (planning, coming up with strategies, in general being an effective agent in the real world). But this distinction between “wanted” and “unwanted” is not a natural distinction; it is, in fact, a distinction highly entangled with human concepts and human values, and any “filter” that selects based on said distinction will need to be of similar complexity. (Of identical complexity, in fact, to the whole alignment problem.) “Simple” filters like the thing you are calling “myopia” definitely do not suffice to perform this function.
I’d be interested in hearing which aspect(s) of the above model you disagree with, and why.
If the underlying process your myopic agent was trained to imitate would (under some set of circumstances) be incentivized to deceive you, and the myopic agent (by hypothesis) imitates the underlying process to sufficient resolution, why would the deceptive behavior of the underlying process not be reflected in the behavior of the myopic agent?
Yeah, this is obviously true. Certainly if you have an objective of imitating something that would act deceptively, you’ll get deception. The solution isn’t to somehow “filter out the unwanted instrumental behavior from the wanted instrumental behavior,” though, it’s just to not imitate something that would be deceptive.
It’s perhaps worth pointing out why, if we have something to imitate already that isn’t deceptive, why we don’t just run that thing directly—and the answer is that we can’t: all of the sorts of things that might be both competitive and safe to myopically imitate are things like HCH that are too inefficient to run directly.
This is a great thread. Let me see if I can restate the arguments here in different language:
Suppose Bob is a smart guy whom we trust to want all the best things for humanity. Suppose we also have the technology to copy Bob’s brain into software and run it in simulation at, say, a million times its normal speed. Then, if we thought we had one year between now and AGI (leaving aside the fact that I just described a literal AGI in the previous sentence), we could tell simulation-Bob, “You have a million subjective years to think of an effective pivotal act in the real world, and tell us how to execute it.” Bob’s a smart guy, and we trust him to do the right thing by us; he should be able to figure something out in a million years, right?
My understanding of Evan’s argument at this point would be: “Okay; so we don’t have the technology to directly simulate Bob’s brain. But maybe instead we can imitate its I/O signature by training a model against its actions. Then, because that model is software, we can (say) speed it up a million times and deal with it as if it was a high-fidelity copy of Bob’s brain, and it can solve alignment / execute pivotal action / etc. for us. Since Bob was smart, the model of Bob will be smart. And since Bob was trustworthy, the model of Bob will be trustworthy to the extent that the training process we use doesn’t itself introduce novel long-term dependencies that leave room for deception.”
Note that myopia — i.e., the purging of long term dependencies from the training feedback signal — isn’t really conceptually central to the above scheme. Rather it is just a hack intended to prevent additional deception risks from being introduced through the act of copying Bob’s brain. The simulated / imitated copy of Bob is still a full-blown consequentialist, with all the manifold risks that entails. So the scheme is basically a way of taking an impractically weak system that you trust, and overclocking it but not otherwise affecting it, so that it retains (you hope) the properties that made you trust it in the first place.
At this point my understanding of Eliezer’s counterargument would be: “Okay sure; but find me a Bob that you trust enough to actually put through this process. Everything else is neat, but it is downstream of that.” And I think that this is correct and that it is a very, very strong objection, but — under certain sets of assumptions about timelines, alternatives, and counterfactual risks — it may not be a complete knock-down. (This is the “belling the cat” bit, I believe.)
And at this point, maybe (?) Evan says, “But wait; the Bob-copy isn’t actually a consequentialist because it was trained myopically.” And if that’s what Evan says, then I believe this is the point at which there is an empirically resolvable disagreement.
Is this roughly right? Or have I missed something?
Eliezer’s counterargument is “You don’t get a high-fidelity copy of Bob that can be iterated and recursed to do arbitrary amounts of work a Bob-army could do, the way Bob could do it, until many years after the world otherwise ends. The imitated Bobs are imperfect, and if they scale to do vast amounts of work, kill you.”
To be clear, I agree with this as a response to what Edouard said—and I think it’s a legitimate response to anyone proposing we just do straightforward imitative amplification, but I don’t think it’s a response to what I’m advocating for in this post (though to be fair, this post was just a quick sketch, so I suppose I shouldn’t be too surprised that it’s not fully clear).
In my opinion, if you try to imitate Bob and get a model that looks like it behaves similarly to Bob, but no have no other guarantees about it, that’s clearly not a safe model to amplify, and probably not even a safe model to train in the first place. That’s because instead of getting a model that actually cares about imitating Bob or anything like that, you probably just got some pseudo-aligned mesa-optimizer with an objective that produces behavior that happens to correlate well with Bob’s.
However, there does exist a purely theoretical construct—what would happen if you actually amplified Bob, not an imitation of Bob—that is very likely to be safe and superhuman (though probably still not fully competitive, but we’ll put that aside for now since it doesn’t seem to be the part you’re most skeptical of). Thus, if you could somehow get a model that was in fact trying to imitate amplified Bob, you might be okay—except that that’s not true, because most types of agents, when given the objective of imitating a safe thing, will end up with a bunch of convergent instrumental goals that break that safety. However, I claim that there are natural types of agents (that is, not too complex on a simplicity prior) that, when given the objective of imitating a safe thing, do so safely. That’s what I mean by my step (1) above (and of course, even if such natural agents exist, there’s still a lot you have to do to make sure you get them—that’s the rest of the steps).
But since you seem most skeptical of (1), maybe I’ll try to lay out my basic case for how I think we can get a theory of simple, safe imitators (including simple imitators with arbitrary levels of optimization power):
All the really basic concerns—e.g. it tries to get more compute so it can simulate better—can be solved by having a robust Cartesian boundary and having an agent that optimizes an objective defined on actions through the boundary (similarly to why an approval-directed agent wouldn’t do this sort of thing—the main problem with approval-directed agents just being that human approval is not a very good thing to optimize for).
Specifying a robust Cartesian boundary is not that hard—you just need a good multi-level world-model, which any powerful agent should have to have anyway.
There are remaining issues related to superrationality, but those can be avoided by having a decision theory that ignores them (e.g. the right sort of CDT variant).
There are also some remaining issues related to tiling, but those can be avoided if the Cartesian boundary is structured in such a way that it excludes other agents (this is exactly the trick that LCDT pulls).
All the really basic concerns—e.g. it tries to get more compute so it can simulate better—can be solved by having a robust Cartesian boundary and having an agent that optimizes an objective defined on actions through the boundary
I’m confused from several directions here. What is a “robust” Cartesian boundary, why do you think this stops an agent from trying to get more compute, and when you postulate “an agent that optimizes an objective” are you imagining something much more like an old chess-playing system with a known objective than a modern ML system with a loss function?
are you imagining something much more like an old chess-playing system with a known objective than a modern ML system with a loss function?
No—I’m separating out two very important pieces that go into training a machine learning model: what sort of model you want to get and how you’re going to get it. My step (1) above, which is what I understand that we’re talking about, is just about that first piece: understanding what we’re going to be shooting for when we set up our training process (and then once we know what we’re shooting for we can think about how to set up a training process to actually land there). See “How do we become confident in the safety of a machine learning system?” for understanding this way of thinking about ML systems.
It’s worth pointing out, however, that even when we’re just focusing on that first part, it’s very important that we pay attention to the total complexity that we’re paying in specifying what sort of model we want, since that’s going to determine a lot of how difficult it will be to actually construct a training process that produces such a model. Exactly what sort of complexity we should be paying attention to is a bit unclear, but I think that the best model we currently have of neural network inductive biases is something like a simplicity prior with a speed cap (see here for some empirical evidence for this).
What is a “robust” Cartesian boundary, why do you think this stops an agent from trying to get more compute
Broadly speaking, I’d say that a Cartesian boundary is robust if the agent has essentially the same concept of what its action, observation, etc. is regardless of what additional true facts it learns about the world.
The Cartesian boundary itself does nothing to prevent an agent from trying to get more compute to simulate better, but having an objective that’s just specified in terms of actions rather than world states does. If you want a nice simple proof of this, Alex Turner wrote one up here (and discusses it a bit more here), which demonstrates that instrumental convergence disappears when you have an objective specified in terms of action-observation histories rather than world states.
Like I said above, however, there are still some remaining problems—just having an objective specified in terms of actions isn’t quite enough.
Thanks, that helps. So actually this objection says: “No, the biggest risk lies not in the trustworthiness of the Bob you use as the input to your scheme, but rather in the fidelity of your copying process; and this is true even if the errors in your copying process are being introduced randomly rather than adversarially. Moreover, if you actually do develop the technical capability to reduce your random copying-error risk down to around the level of your Bob-trustworthiness risk, well guess what, you’ve built yourself an AGI. But since this myopic copying scheme thing seems way harder than the easiest way I can think of to build an AGI, that means a fortiori that somebody else built one the easy way several years before you built yours.”
Closer, yeah. In the limit of doing insanely complicated things with Bob you will start to break him even if he is faithfully simulated, you will be doing things that would break the actual Bob; but I think HCH schemes fail long before they get to that point.
Abstracting out one step: there is a rough general argument that human-imitating AI is, if not perfectly safe, then at least as safe as the humans it’s imitating. In particular, if it’s imitating humans working on alignment, then it’s at least as likely as we are to come up with an aligned AI. Its prospects are no worse than our prospects are already. (And plausibly better, since the simulated humans may have more time to solve the problem.)
For full strength, this argument requires that:
It emulate the kind of alignment research which the actual humans would do, rather than some other kind of work
It correctly imitates the humans
Once we relax either of those assumptions, the argument gets riskier. A relaxation of the first assumption would be e.g. using HCH in place of humans working normally on the problem for a while (I expect this would not work nearly as well as the actual humans doing normal research, in terms of both safety and capability). The second assumption is where inner alignment problems and Evan’s work enter the picture.
The solution isn’t to somehow “filter out the unwanted instrumental behavior from the wanted instrumental behavior,” though, it’s just to not imitate something that would be deceptive.
Okay, I think this helps me understand your view better. Specifically, my initial characterization of your proposals as “Imitate a (non-myopic, potentially unsafe) process X” should be amended to “Imitate a (non-myopic, but nonetheless safe) process X,” where the reason to do the imitation isn’t necessarily to buy anything extra in terms of safety, but simply efficiency.
If this is the case, however, then it raises two more questions in my view:
(less important) What work is “myopia” doing in the training process? Is the point of running a myopic imitator also just to buy efficiency (how so?), or is “myopia” somehow producing additional safety gains on top of the (already-postulated-to-exist) safety inherent to the underlying (non-deceptive) process?
(more important) It still seems to me that you are presupposing the existence of a (relatively easy-to-get) process that is simultaneously “powerful” and “non-deceptive” (“competitive and safe”, to use the same words you used in your most recent response). Again, on my view (of Eliezer’s view), this is something that is not easy to get; either the cognitive process you are working with is general enough to consider instrumental strategies (out of which deception naturally emerges as a special case), or it is not; this holds unless there is some special “filter” in place that specifically separates wanted instrumental strategies from unwanted ones. I would still like to know if/how you disagree with this, particularly as it concerns things like e.g. HCH, which (based on your previous comments) you seem to view as an example of a “competitive but safe” process.
Specifically, my initial characterization of your proposals as “Imitate a (non-myopic, potentially unsafe) process X” should be amended to “Imitate a (non-myopic, but nonetheless safe) process X,” where the reason to do the imitation isn’t necessarily to buy anything extra in terms of safety, but simply efficiency.
My model of Evan is gonna jump in here (and he can correct me if I’m wrong), see if it helps….
I like the first part, but I don’t think the “simply efficiency” part is correct.
Instead I think, actually training a model involves real-world model-training things like “running gradient descent on GPUs”. But Process X doesn’t have to involve “running gradient descent on GPUs”. Process X can be a human in the real world, or some process existing in a platonic sandbox, or whatever.
If we train a model to be myopically imitating every step of Process X, we get non-myopia in Process X’s world (e.g. the world of the human making their human plans), but we get myopia in regards to “running gradient descent on GPUs” and such.
I think Evan is using a specific sense of “deception” which is intimately related to “running gradient descent on GPUs”, so he can declare victory over (this form of) “deception”.
(Unless, I guess, instead of imitating the steps of safe non-myopic Process X, we accidentally imitate the steps of dangerous non-myopic Process Y, which is so clever that it figures out that it’s running in a simulation and tries to hack into base reality, or whatever.)
In other words, the reason to do the myopic imitation is that (non-myopic but nevertheless safe) process X is not a trained model, it’s an idea, or ideal. We want to get from there to a trained model without introducing new safety problems in the process.
(Not agreeing or disagreeing with any of this, just probing my understanding.)
[a big assemblage of myopic reasoners which outputs far-reaching plans]
There’s no big assemblage, just one single myopic optimizer.
assemble myopic reasoners
I have no idea where you’re getting this idea of an assemblage from; nowhere did I say anything about that.
this is supposed to happen in HCH
Imitating HCH is just an example, you could substitute in any other myopic objective that might be aligned and competitive instead.
If there’s non-myopicness happening in each step of the human consulting HCH, then the imitator is imitating a non-myopic reasoner and so is non-myopic (and this is compounded by distillation steps).
If that’s how you want to define myopia/non-myopia then sure, you’re welcome to call an HCH imitator non-myopic. But that’s not the version of myopia that I’m working with/care about.
I have no idea where you’re getting this idea of an assemblage from; nowhere did I say anything about that.
Huh. There’s definitely some miscommunication happening...
From the post:
For example, a myopic agent could myopically simulate a strongly-believed-to-be-safe non-myopic process such as HCH, allowing imitative amplification to be done without ever breaking a myopia guarantee
In general, I think it’s just not very hard to leverage careful recursion to turn non-myopic objectives into myopic objectives such that it’s possible for a myopic agent to do well on them
You give HCH + iterative amplification as an example, which I responded to. You say that in general, recursion can allow myopic agents to do well on non-myopic objectives; this sure sounds like making a kind of assemblage in order to get non-myopicness. You link: https://www.lesswrong.com/posts/YWwzccGbcHMJMpT45/ai-safety-via-market-making , which I hadn’t seen before, but at a glance, it (1) predicts and manipulates humans, which are non-myopic reasoners, (2) involves iteration, and (3) an as additional component, uses Amp(M) (an assemblage of myopic reasoners, no?).
you could substitute in any other myopic objective that might be aligned and competitive instead.
Oops, there’s more confusion here. HCH is a myopic objective? I could emit the sentence, “the AI is only trained to predict the answer given by HCH to the question that’s right in front of it”, but I don’t think I understand a perspective in which that’s really myopic, in the sense of not doing consequentialist reasoning about far-reaching plans, given that it’s predicting (1) humans (2) in a big assemblage that (3) by hypothesis successfully answer questions about far-reaching plans (and (4) using Amp, which is a big spot where generalization (e.g. consequentialist generalization) comes in). Could you point me towards a more detailed writeup / discussion about what’s meant by HCH being a relevantly myopic objective that responds to the objection about, well, its output does nevertheless get right answers to questions about far-reaching consequences?
myopic objective that might be aligned and competitive instead
I’m interested in whether objectives can be aligned and competitive and myopic. That still seems like the cat-belling step.
If that’s how you want to define myopia/non-myopia then sure, you’re welcome to call an HCH imitator non-myopic. But that’s not the version of myopia that I’m working with/care about.
From point 1. of the OP:
I think it is possible to produce a simple, natural description of myopia such that myopic agents are still capable of doing all the powerful things we might want out of an AGI but such that they never have any reason to be deceptive
My best current guess is that you’re saying something like, if the agent is myopic, that means it’s only trained to try to solve the problem right in front of it; so it’s not trained to hide its reasoning in order to game the system across multiple episodes? What’s the argument that this implies non-deceptiveness? (Link would be fine.) I was trying to say, if it’s predicting a far-conquences-understander, it has to do far-consequences-understanding, therefore it’s able to do far-consequences-understanding, therefore it’s (1) liable to, by default, in effect have values it pursues over far-consequences, and (2) is able to effectively pursue those values without further ado. The case for (2) is more clear, since arguendo it is able to do far-consequences-understanding. Maybe the case for (1) needs to be made.
You say that in general, recursion can allow myopic agents to do well on non-myopic objectives; this sure sounds like making a kind of assemblage in order to get non-myopicness.
The recursion there is only in the objective, not in the model itself. So there’s no assemblage anywhere other than in the thing that the model is trying to imitate.
HCH is a myopic objective?
Maybe it’ll be more clear to you if you just replace “imitate HCH” with “imitate Evan” or something like that—of course that’s less likely to result in a model that’s capable enough to do anything interesting, but it has the exact same sorts of problems in terms of getting myopia to work.
My best current guess is that you’re saying something like, if the agent is myopic, that means it’s only trained to try to solve the problem right in front of it; so it’s not trained to hide its reasoning in order to game the system across multiple episodes?
We’re just talking about step (1), so we’re not talking about training at all right now. We’re just trying to figure out what a natural class of agents would be that isn’t deceptive.
if it’s predicting a far-conquences-understander, it has to do far-consequences-understanding, therefore it’s able to do far-consequences-understanding
Agree. Any competitive myopic agent would have to be able to fully understand exactly how to do long-term non-myopic reasoning.
therefore it’s (1) liable to, by default, effectively have values it pursues over far-consequences
Agree by default but not by necessity. For step (1) we’re not trying to figure out what would happen by default if you trained a model on something, we’re just trying to understand what it might look like for an agent to be myopic in a natural way.
just replace “imitate HCH” with “imitate Evan” or something like that
So these are both training-myopic, meaning they both are being trained only to do the task right in front of them, and aren’t (directly) rewarded for behavior that sacrifices reward now for reward in future episodes. Neither seem objective-myopic, meaning both of their objective functions are computed (seemingly necessarily) using far-reaching-consequences-understanding. Neither seem behavior-myopic, meaning both of them would successfully target far-reaching-consequences (by assumption of being competitive?). I think if you’re either objective-non-myopic or behavior-non-myopic, then by default you’re thought-non-myopic (meaning you in fact use far-reaching-consequences-understanding in your reasoning). I think if you’re thought-non-myopic, then by default you’re values-non-myopic, meaning you’re pursuing specific far-reaching-consequences. I think if you’re values-non-myopic, then you’re almost certainly deceptive, by strong default.
We’re just talking about step (1), so we’re not talking about training at all right now. We’re just trying to figure out what a natural class of agents would be that isn’t deceptive.
For step (1) we’re not trying to figure out what would happen by default if you trained a model on something, we’re just trying to understand what it might look like for an agent to be myopic in a natural way.
In step (1) you wrote:
I think it is possible to produce a simple, natural description of myopia such that myopic agents are still capable of doing all the powerful things we might want out of an AGI but such that they never have any reason to be deceptive
I think if something happens by default, that’s a kind of naturalness. Maybe I just want to strengthen the claims above to say “by strong default”. In other words, I’m saying it’s a priori very unnatural to have something that’s behavior-non-myopic but thought-myopic, or thought-non-myopic but not deceptive, and overcoming that unnaturality is a huge hurdle. I would definitely be interested in your positive reasons for thinking this is possible.
I think if you’re values-non-myopic, then you’re almost certainly deceptive, by strong default.
I think it would help if you tried to walk through how a model with the goal of “imitating Evan” ends up acting deceptively. I claim that as long as you have a notion of myopic imitation that rules out failure modes like acausal trade (e.g. LCDT) and Evan will never act deceptively, then such a model will never act deceptively.
Your steps (2)-(4) seem to rely fairly heavily on the naturality of the class described in (1), e.g. because (2) has to recognize (1)s which requires that we can point to (1)s. If by “with the [[sole?]] goal of imitating Evan” you mean that
A. the model is actually really *only* trying to imitate Evan,
B. the model is competent to not accidentally also try to do something else (e.g. because the ways it pursues its goal are themselves malign under distributional shift), and
C. the training process you use will not tip the internal dynamics of the model over into a strategically malign state (there was never any incentive to prevent that from happening any more robustly than just barely enough to get good answers on the training set, and I think we agree that there’s a whole pile of [ability to understand and pursue far-reaching consequences] sitting in the model, making strategically malign states pretty close in model-space for natural metrics),
then yes this would plausibly not be deceptive, but it seems like a very unnatural class. I tried to argue that it’s unnatural in the long paragraph with the different kinds of myopia, where “by (strong) default” = “it would be unnatural to be otherwise”.
Note that (A) and (B) are not actually that hard—e.g. LCDT solves both problems.
Your (C), in my opinion, is where all the action is, and is in fact the hardest part of this whole story—which is what I was trying to say in the original post when I said that (2) was the hard part.
Okay, I think I’m getting a little more where you’re coming from? Not sure. Maybe I’ll read the LCDT thing soon (though I’m pretty skeptical of those claims).
(Not sure if it’s useful to say this, but as a meta note, from my perspective the words in the post aren’t pinned down enough to make it at all clear that the hard part is (2) rather than (1); you say “natural” in (1), and I don’t know what you mean by that such that (1) isn’t hard.)
Maybe I’m not emphasizing how unnatural I think (A) is. Like, it’s barely even logically consistent. I know that (A) is logically consistent, for some funny construal of “only trying”, because Evan is a perfect imitation of Evan; and more generally a good WBE could maybe be appropriately construed as not trying to do anything other than imitate Evan; and ideally an FAI could be given an instruction so that it doesn’t, say, have any appreciable impacts other than the impacts of an Evan-imitation. For anything that’s remotely natural and not “shaped” like Evan is “shaped”, I’m not sure it even makes sense to be only trying to imitate Evan; to imitate Evan you have to do a whole lot of stuff, including strategically arranging cognition, reason about far-reaching consequences in general, etc., which already constitutes trying to do something other than imitating Evan. When you’re doing consequentialist reasoning, that already puts you very close in algorithm-space to malign strategic thinking, so “consequentialist but not deceptive (hence not malignly consequentialist)” is very unnatural; IMO like half of the whole the alignment problem is “get consequentialist reasoning that isn’t consequentalisting towards some random thing”.
I read Yudkowsky as positing some kind of conservation law. Something like, if the plans produced by your AI succeed at having specifically chosen far-reaching consequences if implemented, then the AI must have done reasoning about far-reaching consequences.
Why this seems true:
Any planning process which robustly succeeds must behave differently in the presence of different latent problems.
If I’m going to the store and one of two routes may be closed down, and I want to always arrive at the store, my plan must somehow behave differently in the presence of the two possible latent complications (the road which is closed).
A pivotal act requires a complicated plan with lots of possible latent problems.
Any implementing process (like an AI) which robustly enacts a complicated plan (like destroying unaligned AGIs) must somehow behave differently in the presence of many different problems (like the designers trying to shut down the AI).
Thus, robustly pulling off a pivotal act requires some kind of “reasoning about far-reaching consequences” on the latent world state.
The key idea, in the case of HCH, would be to direct that optimization towards the goal of producing an action that is maximally close to what HCH would do.
Why do you expect this to be any easier than directing that optimisation towards the goal of “doing what the human wants”? In particular, if you train a system on the objective “imitate HCH”, why wouldn’t it just end up with the same long-term goals as HCH has? That seems like a much more natural thing for it to learn than the concept of imitating HCH, because in the process of imitating HCH it still has to do long-term planning anyway.
(I feel like this is basically the same set of concerns/objections that I raised in this post. I also think that myopia is a fairly central example of the thing that Eliezer was objecting to with his “water” metaphor in our dialogue, and I endorse his objection in this context.)
Why do you expect this to be any easier than directing that optimisation towards the goal of “doing what the human wants”? In particular, if you train a system on the objective “imitate HCH”, why wouldn’t it just end up with the same long-term goals as HCH has?
To be clear, I was only talking about (1) here, which is just about what it might look like for an agent to be myopic, not how to actually get an agent that satisfies (1). I agree that you would most likely get a proxy-aligned model if you just trained on “imitate HCH”—but just training on “imitating HCH” is definitely not the plan. See (2), (3), (4), (5) for how we actually get an agent that satisfies (1).
In terms of ease of getting (1)/naturalness of (1), all we need out of (1) there is for our concept of myopia to not cost so many bits that it’s too unnatural to get (2), (3), and (4) to work, not that it’s the most natural thing for you to get if all you do is just train on imitative amplification.
That all makes sense. But I had a skim of (2), (3), (4), and (5) and it doesn’t seem like they help explain why myopia is significantly more natural than “obey humans”?
I mean, that’s because this is just a sketch, but a simple argument for why myopia is more natural than “obey humans” is that if we don’t care about competitiveness, we already know how to build myopic optimizers, whereas we don’t know how to build an optimizer to “obey humans” at any level of capabilities.
Furthermore, LCDT is a demonstration that we can at least reduce the complexity of specifying myopia to the complexity of specifying agency. I suspect we can get much better upper bounds on the complexity than that, though.
Furthermore, LCDT is a demonstration that we can at least reduce the complexity of specifying myopia to the complexity of specifying agency.
It’s an interesting idea, but are you confident that LCDT actually works? E.g. have you thought more about the issues I talked about here and concluded they’re not serious problems?
I still don’t see how we could get e.g. an HCH simulator without agentic components (or the simulator’s qualifying as an agent). As soon as an LCDT agent expects that it may create agentic components in its simulation, it’s going to reason horribly about them (e.g. assuming that any adjustment it makes to other parts of its simulation can’t possibly impact their existence or behaviour, relative to the prior).
I think LCDT does successfully remove the incentives you’re aiming to remove. I just expect it to be too broken to do anything useful. I can’t currently see how we could get the good parts without the brokenness.
I think you might be able to design advanced nanosystems without AI doing long term real world optimization.
Well a sufficiently large team of smart humans could probably design nanotech. The question is how much an AI could help.
Suppose unlimited compute. You program a simulation of quantum field theory. Add a GUI to see visualizations and move atoms around. Designing nanosystems is already quite a bit easier.
Now suppose you brute force search over all arrangements of 100 atoms within a 1nm box, searching for the configuration that most efficiently transfers torque.
You do similar searches for the smallest arrangement of atoms needed to make a functioning logic gate.
Then you download an existing microprocessor design, and copy it (but smaller) using your nanologic gates.
I know that if you start brute forcing over a trillion atoms, you might find a mesaoptimizer. (Although even then I would suspect that visualization inspection shouldn’t result in anything brain hacky. It would only be actually synthesizing such a thing that was dangerous. (or maybe possibly simulating it, if the mesaoptimizer realizes it’s in a simulation and there are general simulation escape strategies ))
So look at the static output of your brute forcing. If you see anything that looks computational, delete it. Don’t brute force anything too big.
(Obviously you need human engineers here, any long term real world planning is coming from them.)
The notion of (1) seems like the cat-belling problem here; the other steps don’t seem interesting by comparison, the equivalent of talking about all the neat things to do after belling the cat.
What pivotal act is this AGI supposed to be executing? Designing a medium-strong nanosystem? How would you do that via a myopic system? That means the AGI needs to design a nanosystem whose purpose spans over time and whose current execution has distant good consequences. It doesn’t matter whether you claim it’s being done by something that internally looks like myopic HCH any more than it matters that it’s being done by internal transistors that don’t have tiny models of the future inside themselves. What’s consequentialist and farseeing isn’t the transistors, or the floating-point multiplications, or the elaborate HCH or whatever, it’s the actual work and actual problem being solved by the system whereby it produces a nanosystem that has coherent effects on the physical world spanning hours and days.
I’m surprised that you think (1) is the hard part—though (1) is what I’m currently working on, since I think it’s necessary to make a lot of the other parts go through, I expect it to be one of the easiest parts of the story to make work.
I left this part purposefully vague, but I’m happy to accept designing a medium-strong nanosystem as the pivotal act to consider here for the sake of argument, since I think that if your advanced AI can’t at least do that, then it probably can’t do anything else pivotal either.
Agreed.
I think this is where you misunderstand me. I suspect that you don’t really understand what I mean by myopia.
Let me see if I can explain, just using the HCH example. Though I suspect that imitating HCH is actually not powerful enough to do a pivotal act—and I suspect you agree—it’s a perfectly good example to showcase what I mean by myopia.
To start with, the optimization wouldn’t be done by HCH, or anything that would internally look like HCH in the slightest—rather, the optimization would be done by whatever powerful optimization process is inside of our model. Where myopia comes into play is in what goal we’re trying to direct that optimization towards. The key idea, in the case of HCH, would be to direct that optimization towards the goal of producing an action that is maximally close to what HCH would do. In such a situation, you would have a model that can use its own powerful internal optimization procedures to imitate what HCH would do as effectively as possible—able to do things like effectively manage cognitive resources and reason about how best to go about producing an action that is as close as possible to HCH.
The natural class that I think this example is pointing to is the class of optimizers that optimize for an objective that is exclusively about their action through a Cartesian boundary, rather than the consequences of their action on the world. Such optimizers can still end up producing actions with far-reaching consequences on the world if they deploy their optimization power in the service of an objective like imitating HCH that requires producing actions with particular consequences, however. In such a situation, the model would be actively doing lots of reasoning about the consequences of its actions on the world, but not for the goal of producing a particular consequence, but rather just for the goal of producing a particular action, e.g. the one that matches up to what HCH would do. Thus, optimizers of this form can do all sorts of extremely powerful, long-term, non-myopic tasks—but without ever having any incentive to act deceptively.
Notably, there are a bunch of nuances here, regarding things like ensuring that the agent doesn’t end up optimizing its objective non-myopically because of acausal trade considerations, ensuring that it doesn’t want to self-modify into a different sort of agent, making sure it doesn’t just spin up other agents that act non-myopically, etc., but these problems are really not that hard to solve. As a proof of concept, LCDT definitely solves all of these problems, showcasing that an optimizing system that really “just imitates HCH” is possible. Unfortunately, LCDT is not quite as natural as I would like, since it requires paying a bunch of bits of complexity to specify a fundamental concept of an “agent,” such that I don’t think that the final solution here will actually look much like LCDT. Rather, I suspect that a more natural class of myopic agents will come from something more like analyzing the general properties of different types of optimizers over Cartesian boundaries.
Regardless, I strongly doubt that just developing a proper notion of myopia here poses a fundamental obstacle—we already have evidence that optimizers of this form are possible and can have the desired properties, and the basic concept of “an optimizer that just cares about its next action” is natural enough that I’d be quite surprised if we couldn’t fully systematize it. I do suspect that any systematization will require paying the complexity of specifying a Cartesian boundary, but I’d be quite surprised if that cost us enough complexity to make the desired class too unnatural.
Certainly it doesn’t matter what substrate the computation is running on. I don’t think this is really engaging with anything that I’m saying.
I read Yudkowsky as positing some kind of conservation law. Something like, if the plans produced by your AI succeed at having specifically chosen far-reaching consequences if implemented, then the AI must have done reasoning about far-reaching consequences. Then (I’m guessing) Yudkowsky is applying that conservation law to [a big assemblage of myopic reasoners which outputs far-reaching plans], and concluding that either the reasoners weren’t myopic, or else the assemblage implements a non-myopic reasoner with the myopic reasoners as a (mere) substrate.
Reasoning correctly about far-reaching consequences by default (1) has mistargeted consequences, and (2) is done by summoning a dangerous reasoner.
I think what you’re saying here implies that you think it is feasible to assemble myopic reasoners into a non-myopic reasoner, without compromising safety. My possibly straw understanding, is that the way this is supposed to happen in HCH is that, basically, the humans providing the feedback train the imitator(s) to implement a collective message-passing algorithm that answers any reasonable question or whatever. This sounds like a non-answer, i.e. it’s just saying ”...and then the humans somehow assemble myopic reasoners into a non-myopic reasoner”. Where’s the non-myopicness? If there’s non-myopicness happening in each step of the human consulting HCH, then the imitator is imitating a non-myopic reasoner and so is non-myopic (and this is compounded by distillation steps). If there isn’t non-myopicness happening in each step, how does it come in to the assembly?
Endorsed.
To be clear, I agree with this also, but don’t think it’s really engaging with what I’m advocating for—I’m not proposing any sort of assemblage of reasoners; I’m not really sure where that misconception came from.
I don’t think the assemblage is the point. I think the idea here is that “myopia” is a property of problems: a non-myopic problem is (roughly) one which inherently requires doing things with long time horizons. I think Eliezer’s claim is that (1) a (good) pivotal act is probably a non-myopic problem, and (2) you can’t solve a nontrivial nonmyopic problem with a myopic solver. Part (2) is what I think TekhneMakr is gesturing at and Eliezer is endorsing.
My guess is that you have some idea of how a myopic solver can solve a nonmyopic problem (by having it output whatever HCH would do, for instance). And then Eliezer would probably reply that the non-myopia has been wrapped up somewhere else (e.g. in HCH), and that has become the dangerous part (or, more realistically, the insufficiently capable part, and I expect Eliezer would claim that replacing it with something both sufficiently capable and aligned is about as hard as the whole alignment problem). I’m not sure what your response would be to that.
Yeah, that’s right, I definitely agree with (1) and disagree with (2).
I tend to think that HCH is not dangerous, but I agree that it’s likely insufficiently capable. To solve that problem, we have to do go to a myopic objective that is more powerful. But that’s not that hard, and there’s lots of them that can incentivize good non-myopic behavior that are safe to optimize for as long as the optimizer is myopic.
AI safety via market making is one example, but it’s a very tricky one, so maybe not the best candidate for showcasing what I mean. In particular, I suspect that a myopic optimizer given the goal of acting as a trader or market-maker in such a setup wouldn’t act deceptively, though I suspect they would Goodhart on the human approval signal in unsafe ways (which is less bad of a problem than deception, and could potentially be solved via something like my step (6), but still a pretty serious problem).
Maybe a better example would be something like imitative generalization. If imitating HCH is insufficient, we can push further by replacing “imitate HCH” with “output the hypothesis which maximizes HCH’s prior times the hypothesis’s likelihood,” which gets you substantially farther and I think is still safe to optimize for given a myopic optimizer (though neither are safe for a non-myopic optimizer).
It still doesn’t seem to me like you’ve sufficiently answered the objection here.
What if any sufficiently powerful objective is non-myopic? Or, on a different-but-equivalent phrasing: what if myopia is a property only of very specific toy objectives, rather than a widespread property of objectives in general (including objectives that humans would intuitively consider to be aimed at accomplishing things “in the real world”)?
It seems to me that Eliezer has presented quite compelling arguments that the above is the case, and on a first pass it doesn’t look to me like you’ve countered those arguments.
How does a “myopic optimizer” successfully reason about problems that require non-myopic solutions, i.e. solutions whose consequences extend past whatever artificial time-frame the optimizer is being constrained to reason about? To the extent that it does successfully reason about those things in a non-myopic way, in what remaining sense is the optimizer myopic?
Both of these seem to be examples of solutions that simply push the problem back a step, rather than seeking to eliminate it directly. My model of Eliezer would call this attempting to manipulate confusion, and caution that, although adding more gears to your perpetual motion machine might make the physics-violating component harder to pick out, it does not change the fact that somewhere within the model is a step that violates physics.
In this case, it seems as though all of your proposals are of the form “Train your model to imitate some process X (where X is non-myopic and potentially unsafe), while adding incentives in favor of myopic behavior during training.” To which my model of Eliezer replies, “Either your model will end up myopic, and not powerful enough to capture the part of X that actually does the useful work we are interested in, or it ends up imitating X in full (non-myopic) generality, in which case you have not managed to achieve any kind of safety improvement over X proper.”
It seems to me that to usefully refute this, you need to successfully argue against Eliezer’s background premise here—the one about power and non-myopic reasoning going hand-in-hand in a deep manner that, while perhaps circumventable via similarly deep insights, is not patchable via shallow methods like “Instead of directly using dangerous process X, we will imitate X, thereby putting an extra layer of abstraction between ourselves and the danger.” My current impression is that you have not been arguing against this background premise at all, and as such I don’t think your arguments hit at the core of what makes Eliezer doubt your proposals.
It just reasons about them, using deduction, prediction, search, etc., the same way any optimizer would.
The sense that it’s still myopic is in the sense that it’s non-deceptive, which is the only sense that we actually care about.
The safety improvement that I’m claiming is that it wouldn’t be deceptive. What is the mechanism by which you think a myopic agent would end up acting deceptively?
[Note: Still speaking from my Eliezer model here, in the sense that I am making claims which I do not myself necessarily endorse (though naturally I don’t anti-endorse them either, or else I wouldn’t be arguing them in the first place). I want to highlight here, however, that to the extent that the topic of the conversation moves further away from things I have seen Eliezer talk about, the more I need to guess about what I think he would say, and at some point I think it is fair to describe my claims as neither mine nor (any model of) Eliezer’s, but instead something like my extrapolation of my model of Eliezer, which may not correspond at all to what the real Eliezer thinks.]
If the underlying process your myopic agent was trained to imitate would (under some set of circumstances) be incentivized to deceive you, and the myopic agent (by hypothesis) imitates the underlying process to sufficient resolution, why would the deceptive behavior of the underlying process not be reflected in the behavior of the myopic agent?
Conversely, if the myopic agent does not learn to imitate the underlying process to sufficient resolution that unwanted behaviors like deception start carrying over, then it is very likely that the powerful consequentialist properties of the underlying process have not been carried over, either. This is because (on my extrapolation of Eliezer’s model) deceptive behavior, like all other instrumental strategies, arises from consequentialist reasoning, and is deeply tied to such reasoning in a way that is not cleanly separable—which is to say, by default, you do not manage to sever one without also severing the other.
Again, I (my model of Eliezer) does not think the “deep tie” in question is necessarily insoluble; perhaps there is some sufficiently clever method which, if used, would successfully filter out the “unwanted” instrumental behavior (“deception”, in your terminology) from the “wanted” instrumental behavior (planning, coming up with strategies, in general being an effective agent in the real world). But this distinction between “wanted” and “unwanted” is not a natural distinction; it is, in fact, a distinction highly entangled with human concepts and human values, and any “filter” that selects based on said distinction will need to be of similar complexity. (Of identical complexity, in fact, to the whole alignment problem.) “Simple” filters like the thing you are calling “myopia” definitely do not suffice to perform this function.
I’d be interested in hearing which aspect(s) of the above model you disagree with, and why.
Yeah, this is obviously true. Certainly if you have an objective of imitating something that would act deceptively, you’ll get deception. The solution isn’t to somehow “filter out the unwanted instrumental behavior from the wanted instrumental behavior,” though, it’s just to not imitate something that would be deceptive.
It’s perhaps worth pointing out why, if we have something to imitate already that isn’t deceptive, why we don’t just run that thing directly—and the answer is that we can’t: all of the sorts of things that might be both competitive and safe to myopically imitate are things like HCH that are too inefficient to run directly.
This is a great thread. Let me see if I can restate the arguments here in different language:
Suppose Bob is a smart guy whom we trust to want all the best things for humanity. Suppose we also have the technology to copy Bob’s brain into software and run it in simulation at, say, a million times its normal speed. Then, if we thought we had one year between now and AGI (leaving aside the fact that I just described a literal AGI in the previous sentence), we could tell simulation-Bob, “You have a million subjective years to think of an effective pivotal act in the real world, and tell us how to execute it.” Bob’s a smart guy, and we trust him to do the right thing by us; he should be able to figure something out in a million years, right?
My understanding of Evan’s argument at this point would be: “Okay; so we don’t have the technology to directly simulate Bob’s brain. But maybe instead we can imitate its I/O signature by training a model against its actions. Then, because that model is software, we can (say) speed it up a million times and deal with it as if it was a high-fidelity copy of Bob’s brain, and it can solve alignment / execute pivotal action / etc. for us. Since Bob was smart, the model of Bob will be smart. And since Bob was trustworthy, the model of Bob will be trustworthy to the extent that the training process we use doesn’t itself introduce novel long-term dependencies that leave room for deception.”
Note that myopia — i.e., the purging of long term dependencies from the training feedback signal — isn’t really conceptually central to the above scheme. Rather it is just a hack intended to prevent additional deception risks from being introduced through the act of copying Bob’s brain. The simulated / imitated copy of Bob is still a full-blown consequentialist, with all the manifold risks that entails. So the scheme is basically a way of taking an impractically weak system that you trust, and overclocking it but not otherwise affecting it, so that it retains (you hope) the properties that made you trust it in the first place.
At this point my understanding of Eliezer’s counterargument would be: “Okay sure; but find me a Bob that you trust enough to actually put through this process. Everything else is neat, but it is downstream of that.” And I think that this is correct and that it is a very, very strong objection, but — under certain sets of assumptions about timelines, alternatives, and counterfactual risks — it may not be a complete knock-down. (This is the “belling the cat” bit, I believe.)
And at this point, maybe (?) Evan says, “But wait; the Bob-copy isn’t actually a consequentialist because it was trained myopically.” And if that’s what Evan says, then I believe this is the point at which there is an empirically resolvable disagreement.
Is this roughly right? Or have I missed something?
Eliezer’s counterargument is “You don’t get a high-fidelity copy of Bob that can be iterated and recursed to do arbitrary amounts of work a Bob-army could do, the way Bob could do it, until many years after the world otherwise ends. The imitated Bobs are imperfect, and if they scale to do vast amounts of work, kill you.”
To be clear, I agree with this as a response to what Edouard said—and I think it’s a legitimate response to anyone proposing we just do straightforward imitative amplification, but I don’t think it’s a response to what I’m advocating for in this post (though to be fair, this post was just a quick sketch, so I suppose I shouldn’t be too surprised that it’s not fully clear).
In my opinion, if you try to imitate Bob and get a model that looks like it behaves similarly to Bob, but no have no other guarantees about it, that’s clearly not a safe model to amplify, and probably not even a safe model to train in the first place. That’s because instead of getting a model that actually cares about imitating Bob or anything like that, you probably just got some pseudo-aligned mesa-optimizer with an objective that produces behavior that happens to correlate well with Bob’s.
However, there does exist a purely theoretical construct—what would happen if you actually amplified Bob, not an imitation of Bob—that is very likely to be safe and superhuman (though probably still not fully competitive, but we’ll put that aside for now since it doesn’t seem to be the part you’re most skeptical of). Thus, if you could somehow get a model that was in fact trying to imitate amplified Bob, you might be okay—except that that’s not true, because most types of agents, when given the objective of imitating a safe thing, will end up with a bunch of convergent instrumental goals that break that safety. However, I claim that there are natural types of agents (that is, not too complex on a simplicity prior) that, when given the objective of imitating a safe thing, do so safely. That’s what I mean by my step (1) above (and of course, even if such natural agents exist, there’s still a lot you have to do to make sure you get them—that’s the rest of the steps).
But since you seem most skeptical of (1), maybe I’ll try to lay out my basic case for how I think we can get a theory of simple, safe imitators (including simple imitators with arbitrary levels of optimization power):
All the really basic concerns—e.g. it tries to get more compute so it can simulate better—can be solved by having a robust Cartesian boundary and having an agent that optimizes an objective defined on actions through the boundary (similarly to why an approval-directed agent wouldn’t do this sort of thing—the main problem with approval-directed agents just being that human approval is not a very good thing to optimize for).
Specifying a robust Cartesian boundary is not that hard—you just need a good multi-level world-model, which any powerful agent should have to have anyway.
There are remaining issues related to superrationality, but those can be avoided by having a decision theory that ignores them (e.g. the right sort of CDT variant).
There are also some remaining issues related to tiling, but those can be avoided if the Cartesian boundary is structured in such a way that it excludes other agents (this is exactly the trick that LCDT pulls).
I’m confused from several directions here. What is a “robust” Cartesian boundary, why do you think this stops an agent from trying to get more compute, and when you postulate “an agent that optimizes an objective” are you imagining something much more like an old chess-playing system with a known objective than a modern ML system with a loss function?
No—I’m separating out two very important pieces that go into training a machine learning model: what sort of model you want to get and how you’re going to get it. My step (1) above, which is what I understand that we’re talking about, is just about that first piece: understanding what we’re going to be shooting for when we set up our training process (and then once we know what we’re shooting for we can think about how to set up a training process to actually land there). See “How do we become confident in the safety of a machine learning system?” for understanding this way of thinking about ML systems.
It’s worth pointing out, however, that even when we’re just focusing on that first part, it’s very important that we pay attention to the total complexity that we’re paying in specifying what sort of model we want, since that’s going to determine a lot of how difficult it will be to actually construct a training process that produces such a model. Exactly what sort of complexity we should be paying attention to is a bit unclear, but I think that the best model we currently have of neural network inductive biases is something like a simplicity prior with a speed cap (see here for some empirical evidence for this).
Broadly speaking, I’d say that a Cartesian boundary is robust if the agent has essentially the same concept of what its action, observation, etc. is regardless of what additional true facts it learns about the world.
The Cartesian boundary itself does nothing to prevent an agent from trying to get more compute to simulate better, but having an objective that’s just specified in terms of actions rather than world states does. If you want a nice simple proof of this, Alex Turner wrote one up here (and discusses it a bit more here), which demonstrates that instrumental convergence disappears when you have an objective specified in terms of action-observation histories rather than world states.
Like I said above, however, there are still some remaining problems—just having an objective specified in terms of actions isn’t quite enough.
Thanks, that helps. So actually this objection says: “No, the biggest risk lies not in the trustworthiness of the Bob you use as the input to your scheme, but rather in the fidelity of your copying process; and this is true even if the errors in your copying process are being introduced randomly rather than adversarially. Moreover, if you actually do develop the technical capability to reduce your random copying-error risk down to around the level of your Bob-trustworthiness risk, well guess what, you’ve built yourself an AGI. But since this myopic copying scheme thing seems way harder than the easiest way I can think of to build an AGI, that means a fortiori that somebody else built one the easy way several years before you built yours.”
Is that an accurate interpretation?
Closer, yeah. In the limit of doing insanely complicated things with Bob you will start to break him even if he is faithfully simulated, you will be doing things that would break the actual Bob; but I think HCH schemes fail long before they get to that point.
Gotcha. Well, that seems right—certainly in the limit case.
Abstracting out one step: there is a rough general argument that human-imitating AI is, if not perfectly safe, then at least as safe as the humans it’s imitating. In particular, if it’s imitating humans working on alignment, then it’s at least as likely as we are to come up with an aligned AI. Its prospects are no worse than our prospects are already. (And plausibly better, since the simulated humans may have more time to solve the problem.)
For full strength, this argument requires that:
It emulate the kind of alignment research which the actual humans would do, rather than some other kind of work
It correctly imitates the humans
Once we relax either of those assumptions, the argument gets riskier. A relaxation of the first assumption would be e.g. using HCH in place of humans working normally on the problem for a while (I expect this would not work nearly as well as the actual humans doing normal research, in terms of both safety and capability). The second assumption is where inner alignment problems and Evan’s work enter the picture.
Okay, I think this helps me understand your view better. Specifically, my initial characterization of your proposals as “Imitate a (non-myopic, potentially unsafe) process X” should be amended to “Imitate a (non-myopic, but nonetheless safe) process X,” where the reason to do the imitation isn’t necessarily to buy anything extra in terms of safety, but simply efficiency.
If this is the case, however, then it raises two more questions in my view:
(less important) What work is “myopia” doing in the training process? Is the point of running a myopic imitator also just to buy efficiency (how so?), or is “myopia” somehow producing additional safety gains on top of the (already-postulated-to-exist) safety inherent to the underlying (non-deceptive) process?
(more important) It still seems to me that you are presupposing the existence of a (relatively easy-to-get) process that is simultaneously “powerful” and “non-deceptive” (“competitive and safe”, to use the same words you used in your most recent response). Again, on my view (of Eliezer’s view), this is something that is not easy to get; either the cognitive process you are working with is general enough to consider instrumental strategies (out of which deception naturally emerges as a special case), or it is not; this holds unless there is some special “filter” in place that specifically separates wanted instrumental strategies from unwanted ones. I would still like to know if/how you disagree with this, particularly as it concerns things like e.g. HCH, which (based on your previous comments) you seem to view as an example of a “competitive but safe” process.
My model of Evan is gonna jump in here (and he can correct me if I’m wrong), see if it helps….
I like the first part, but I don’t think the “simply efficiency” part is correct.
Instead I think, actually training a model involves real-world model-training things like “running gradient descent on GPUs”. But Process X doesn’t have to involve “running gradient descent on GPUs”. Process X can be a human in the real world, or some process existing in a platonic sandbox, or whatever.
If we train a model to be myopically imitating every step of Process X, we get non-myopia in Process X’s world (e.g. the world of the human making their human plans), but we get myopia in regards to “running gradient descent on GPUs” and such.
I think Evan is using a specific sense of “deception” which is intimately related to “running gradient descent on GPUs”, so he can declare victory over (this form of) “deception”.
(Unless, I guess, instead of imitating the steps of safe non-myopic Process X, we accidentally imitate the steps of dangerous non-myopic Process Y, which is so clever that it figures out that it’s running in a simulation and tries to hack into base reality, or whatever.)
In other words, the reason to do the myopic imitation is that (non-myopic but nevertheless safe) process X is not a trained model, it’s an idea, or ideal. We want to get from there to a trained model without introducing new safety problems in the process.
(Not agreeing or disagreeing with any of this, just probing my understanding.)
There’s no big assemblage, just one single myopic optimizer.
I have no idea where you’re getting this idea of an assemblage from; nowhere did I say anything about that.
Imitating HCH is just an example, you could substitute in any other myopic objective that might be aligned and competitive instead.
If that’s how you want to define myopia/non-myopia then sure, you’re welcome to call an HCH imitator non-myopic. But that’s not the version of myopia that I’m working with/care about.
Huh. There’s definitely some miscommunication happening...
From the post:
You give HCH + iterative amplification as an example, which I responded to. You say that in general, recursion can allow myopic agents to do well on non-myopic objectives; this sure sounds like making a kind of assemblage in order to get non-myopicness. You link: https://www.lesswrong.com/posts/YWwzccGbcHMJMpT45/ai-safety-via-market-making , which I hadn’t seen before, but at a glance, it (1) predicts and manipulates humans, which are non-myopic reasoners, (2) involves iteration, and (3) an as additional component, uses Amp(M) (an assemblage of myopic reasoners, no?).
Oops, there’s more confusion here. HCH is a myopic objective? I could emit the sentence, “the AI is only trained to predict the answer given by HCH to the question that’s right in front of it”, but I don’t think I understand a perspective in which that’s really myopic, in the sense of not doing consequentialist reasoning about far-reaching plans, given that it’s predicting (1) humans (2) in a big assemblage that (3) by hypothesis successfully answer questions about far-reaching plans (and (4) using Amp, which is a big spot where generalization (e.g. consequentialist generalization) comes in). Could you point me towards a more detailed writeup / discussion about what’s meant by HCH being a relevantly myopic objective that responds to the objection about, well, its output does nevertheless get right answers to questions about far-reaching consequences?
I’m interested in whether objectives can be aligned and competitive and myopic. That still seems like the cat-belling step.
From point 1. of the OP:
My best current guess is that you’re saying something like, if the agent is myopic, that means it’s only trained to try to solve the problem right in front of it; so it’s not trained to hide its reasoning in order to game the system across multiple episodes? What’s the argument that this implies non-deceptiveness? (Link would be fine.) I was trying to say, if it’s predicting a far-conquences-understander, it has to do far-consequences-understanding, therefore it’s able to do far-consequences-understanding, therefore it’s (1) liable to, by default, in effect have values it pursues over far-consequences, and (2) is able to effectively pursue those values without further ado. The case for (2) is more clear, since arguendo it is able to do far-consequences-understanding. Maybe the case for (1) needs to be made.
The recursion there is only in the objective, not in the model itself. So there’s no assemblage anywhere other than in the thing that the model is trying to imitate.
Maybe it’ll be more clear to you if you just replace “imitate HCH” with “imitate Evan” or something like that—of course that’s less likely to result in a model that’s capable enough to do anything interesting, but it has the exact same sorts of problems in terms of getting myopia to work.
We’re just talking about step (1), so we’re not talking about training at all right now. We’re just trying to figure out what a natural class of agents would be that isn’t deceptive.
Agree. Any competitive myopic agent would have to be able to fully understand exactly how to do long-term non-myopic reasoning.
Agree by default but not by necessity. For step (1) we’re not trying to figure out what would happen by default if you trained a model on something, we’re just trying to understand what it might look like for an agent to be myopic in a natural way.
So these are both training-myopic, meaning they both are being trained only to do the task right in front of them, and aren’t (directly) rewarded for behavior that sacrifices reward now for reward in future episodes. Neither seem objective-myopic, meaning both of their objective functions are computed (seemingly necessarily) using far-reaching-consequences-understanding. Neither seem behavior-myopic, meaning both of them would successfully target far-reaching-consequences (by assumption of being competitive?). I think if you’re either objective-non-myopic or behavior-non-myopic, then by default you’re thought-non-myopic (meaning you in fact use far-reaching-consequences-understanding in your reasoning). I think if you’re thought-non-myopic, then by default you’re values-non-myopic, meaning you’re pursuing specific far-reaching-consequences. I think if you’re values-non-myopic, then you’re almost certainly deceptive, by strong default.
In step (1) you wrote:
I think if something happens by default, that’s a kind of naturalness. Maybe I just want to strengthen the claims above to say “by strong default”. In other words, I’m saying it’s a priori very unnatural to have something that’s behavior-non-myopic but thought-myopic, or thought-non-myopic but not deceptive, and overcoming that unnaturality is a huge hurdle. I would definitely be interested in your positive reasons for thinking this is possible.
I think it would help if you tried to walk through how a model with the goal of “imitating Evan” ends up acting deceptively. I claim that as long as you have a notion of myopic imitation that rules out failure modes like acausal trade (e.g. LCDT) and Evan will never act deceptively, then such a model will never act deceptively.
Your steps (2)-(4) seem to rely fairly heavily on the naturality of the class described in (1), e.g. because (2) has to recognize (1)s which requires that we can point to (1)s. If by “with the [[sole?]] goal of imitating Evan” you mean that
A. the model is actually really *only* trying to imitate Evan,
B. the model is competent to not accidentally also try to do something else (e.g. because the ways it pursues its goal are themselves malign under distributional shift), and
C. the training process you use will not tip the internal dynamics of the model over into a strategically malign state (there was never any incentive to prevent that from happening any more robustly than just barely enough to get good answers on the training set, and I think we agree that there’s a whole pile of [ability to understand and pursue far-reaching consequences] sitting in the model, making strategically malign states pretty close in model-space for natural metrics),
then yes this would plausibly not be deceptive, but it seems like a very unnatural class. I tried to argue that it’s unnatural in the long paragraph with the different kinds of myopia, where “by (strong) default” = “it would be unnatural to be otherwise”.
Note that (A) and (B) are not actually that hard—e.g. LCDT solves both problems.
Your (C), in my opinion, is where all the action is, and is in fact the hardest part of this whole story—which is what I was trying to say in the original post when I said that (2) was the hard part.
Okay, I think I’m getting a little more where you’re coming from? Not sure. Maybe I’ll read the LCDT thing soon (though I’m pretty skeptical of those claims).
(Not sure if it’s useful to say this, but as a meta note, from my perspective the words in the post aren’t pinned down enough to make it at all clear that the hard part is (2) rather than (1); you say “natural” in (1), and I don’t know what you mean by that such that (1) isn’t hard.)
Maybe I’m not emphasizing how unnatural I think (A) is. Like, it’s barely even logically consistent. I know that (A) is logically consistent, for some funny construal of “only trying”, because Evan is a perfect imitation of Evan; and more generally a good WBE could maybe be appropriately construed as not trying to do anything other than imitate Evan; and ideally an FAI could be given an instruction so that it doesn’t, say, have any appreciable impacts other than the impacts of an Evan-imitation. For anything that’s remotely natural and not “shaped” like Evan is “shaped”, I’m not sure it even makes sense to be only trying to imitate Evan; to imitate Evan you have to do a whole lot of stuff, including strategically arranging cognition, reason about far-reaching consequences in general, etc., which already constitutes trying to do something other than imitating Evan. When you’re doing consequentialist reasoning, that already puts you very close in algorithm-space to malign strategic thinking, so “consequentialist but not deceptive (hence not malignly consequentialist)” is very unnatural; IMO like half of the whole the alignment problem is “get consequentialist reasoning that isn’t consequentalisting towards some random thing”.
Why this seems true:
Any planning process which robustly succeeds must behave differently in the presence of different latent problems.
If I’m going to the store and one of two routes may be closed down, and I want to always arrive at the store, my plan must somehow behave differently in the presence of the two possible latent complications (the road which is closed).
A pivotal act requires a complicated plan with lots of possible latent problems.
Any implementing process (like an AI) which robustly enacts a complicated plan (like destroying unaligned AGIs) must somehow behave differently in the presence of many different problems (like the designers trying to shut down the AI).
Thus, robustly pulling off a pivotal act requires some kind of “reasoning about far-reaching consequences” on the latent world state.
Yep, I agree with that. That’s orthogonal to myopia as I use the term, though.
(Seems someone −7′d this; would be interested in why.)
Why do you expect this to be any easier than directing that optimisation towards the goal of “doing what the human wants”? In particular, if you train a system on the objective “imitate HCH”, why wouldn’t it just end up with the same long-term goals as HCH has? That seems like a much more natural thing for it to learn than the concept of imitating HCH, because in the process of imitating HCH it still has to do long-term planning anyway.
(I feel like this is basically the same set of concerns/objections that I raised in this post. I also think that myopia is a fairly central example of the thing that Eliezer was objecting to with his “water” metaphor in our dialogue, and I endorse his objection in this context.)
To be clear, I was only talking about (1) here, which is just about what it might look like for an agent to be myopic, not how to actually get an agent that satisfies (1). I agree that you would most likely get a proxy-aligned model if you just trained on “imitate HCH”—but just training on “imitating HCH” is definitely not the plan. See (2), (3), (4), (5) for how we actually get an agent that satisfies (1).
In terms of ease of getting (1)/naturalness of (1), all we need out of (1) there is for our concept of myopia to not cost so many bits that it’s too unnatural to get (2), (3), and (4) to work, not that it’s the most natural thing for you to get if all you do is just train on imitative amplification.
That all makes sense. But I had a skim of (2), (3), (4), and (5) and it doesn’t seem like they help explain why myopia is significantly more natural than “obey humans”?
I mean, that’s because this is just a sketch, but a simple argument for why myopia is more natural than “obey humans” is that if we don’t care about competitiveness, we already know how to build myopic optimizers, whereas we don’t know how to build an optimizer to “obey humans” at any level of capabilities.
Furthermore, LCDT is a demonstration that we can at least reduce the complexity of specifying myopia to the complexity of specifying agency. I suspect we can get much better upper bounds on the complexity than that, though.
It’s an interesting idea, but are you confident that LCDT actually works? E.g. have you thought more about the issues I talked about here and concluded they’re not serious problems?
I still don’t see how we could get e.g. an HCH simulator without agentic components (or the simulator’s qualifying as an agent).
As soon as an LCDT agent expects that it may create agentic components in its simulation, it’s going to reason horribly about them (e.g. assuming that any adjustment it makes to other parts of its simulation can’t possibly impact their existence or behaviour, relative to the prior).
I think LCDT does successfully remove the incentives you’re aiming to remove. I just expect it to be too broken to do anything useful. I can’t currently see how we could get the good parts without the brokenness.
What are you referring to here?
This seems like a very important crux—maybe there should be a scheduled debate on this?
I think you might be able to design advanced nanosystems without AI doing long term real world optimization.
Well a sufficiently large team of smart humans could probably design nanotech. The question is how much an AI could help.
Suppose unlimited compute. You program a simulation of quantum field theory. Add a GUI to see visualizations and move atoms around. Designing nanosystems is already quite a bit easier.
Now suppose you brute force search over all arrangements of 100 atoms within a 1nm box, searching for the configuration that most efficiently transfers torque.
You do similar searches for the smallest arrangement of atoms needed to make a functioning logic gate.
Then you download an existing microprocessor design, and copy it (but smaller) using your nanologic gates.
I know that if you start brute forcing over a trillion atoms, you might find a mesaoptimizer. (Although even then I would suspect that visualization inspection shouldn’t result in anything brain hacky. It would only be actually synthesizing such a thing that was dangerous. (or maybe possibly simulating it, if the mesaoptimizer realizes it’s in a simulation and there are general simulation escape strategies ))
So look at the static output of your brute forcing. If you see anything that looks computational, delete it. Don’t brute force anything too big.
(Obviously you need human engineers here, any long term real world planning is coming from them.)