[a big assemblage of myopic reasoners which outputs far-reaching plans]
There’s no big assemblage, just one single myopic optimizer.
assemble myopic reasoners
I have no idea where you’re getting this idea of an assemblage from; nowhere did I say anything about that.
this is supposed to happen in HCH
Imitating HCH is just an example, you could substitute in any other myopic objective that might be aligned and competitive instead.
If there’s non-myopicness happening in each step of the human consulting HCH, then the imitator is imitating a non-myopic reasoner and so is non-myopic (and this is compounded by distillation steps).
If that’s how you want to define myopia/non-myopia then sure, you’re welcome to call an HCH imitator non-myopic. But that’s not the version of myopia that I’m working with/care about.
I have no idea where you’re getting this idea of an assemblage from; nowhere did I say anything about that.
Huh. There’s definitely some miscommunication happening...
From the post:
For example, a myopic agent could myopically simulate a strongly-believed-to-be-safe non-myopic process such as HCH, allowing imitative amplification to be done without ever breaking a myopia guarantee
In general, I think it’s just not very hard to leverage careful recursion to turn non-myopic objectives into myopic objectives such that it’s possible for a myopic agent to do well on them
You give HCH + iterative amplification as an example, which I responded to. You say that in general, recursion can allow myopic agents to do well on non-myopic objectives; this sure sounds like making a kind of assemblage in order to get non-myopicness. You link: https://www.lesswrong.com/posts/YWwzccGbcHMJMpT45/ai-safety-via-market-making , which I hadn’t seen before, but at a glance, it (1) predicts and manipulates humans, which are non-myopic reasoners, (2) involves iteration, and (3) an as additional component, uses Amp(M) (an assemblage of myopic reasoners, no?).
you could substitute in any other myopic objective that might be aligned and competitive instead.
Oops, there’s more confusion here. HCH is a myopic objective? I could emit the sentence, “the AI is only trained to predict the answer given by HCH to the question that’s right in front of it”, but I don’t think I understand a perspective in which that’s really myopic, in the sense of not doing consequentialist reasoning about far-reaching plans, given that it’s predicting (1) humans (2) in a big assemblage that (3) by hypothesis successfully answer questions about far-reaching plans (and (4) using Amp, which is a big spot where generalization (e.g. consequentialist generalization) comes in). Could you point me towards a more detailed writeup / discussion about what’s meant by HCH being a relevantly myopic objective that responds to the objection about, well, its output does nevertheless get right answers to questions about far-reaching consequences?
myopic objective that might be aligned and competitive instead
I’m interested in whether objectives can be aligned and competitive and myopic. That still seems like the cat-belling step.
If that’s how you want to define myopia/non-myopia then sure, you’re welcome to call an HCH imitator non-myopic. But that’s not the version of myopia that I’m working with/care about.
From point 1. of the OP:
I think it is possible to produce a simple, natural description of myopia such that myopic agents are still capable of doing all the powerful things we might want out of an AGI but such that they never have any reason to be deceptive
My best current guess is that you’re saying something like, if the agent is myopic, that means it’s only trained to try to solve the problem right in front of it; so it’s not trained to hide its reasoning in order to game the system across multiple episodes? What’s the argument that this implies non-deceptiveness? (Link would be fine.) I was trying to say, if it’s predicting a far-conquences-understander, it has to do far-consequences-understanding, therefore it’s able to do far-consequences-understanding, therefore it’s (1) liable to, by default, in effect have values it pursues over far-consequences, and (2) is able to effectively pursue those values without further ado. The case for (2) is more clear, since arguendo it is able to do far-consequences-understanding. Maybe the case for (1) needs to be made.
You say that in general, recursion can allow myopic agents to do well on non-myopic objectives; this sure sounds like making a kind of assemblage in order to get non-myopicness.
The recursion there is only in the objective, not in the model itself. So there’s no assemblage anywhere other than in the thing that the model is trying to imitate.
HCH is a myopic objective?
Maybe it’ll be more clear to you if you just replace “imitate HCH” with “imitate Evan” or something like that—of course that’s less likely to result in a model that’s capable enough to do anything interesting, but it has the exact same sorts of problems in terms of getting myopia to work.
My best current guess is that you’re saying something like, if the agent is myopic, that means it’s only trained to try to solve the problem right in front of it; so it’s not trained to hide its reasoning in order to game the system across multiple episodes?
We’re just talking about step (1), so we’re not talking about training at all right now. We’re just trying to figure out what a natural class of agents would be that isn’t deceptive.
if it’s predicting a far-conquences-understander, it has to do far-consequences-understanding, therefore it’s able to do far-consequences-understanding
Agree. Any competitive myopic agent would have to be able to fully understand exactly how to do long-term non-myopic reasoning.
therefore it’s (1) liable to, by default, effectively have values it pursues over far-consequences
Agree by default but not by necessity. For step (1) we’re not trying to figure out what would happen by default if you trained a model on something, we’re just trying to understand what it might look like for an agent to be myopic in a natural way.
just replace “imitate HCH” with “imitate Evan” or something like that
So these are both training-myopic, meaning they both are being trained only to do the task right in front of them, and aren’t (directly) rewarded for behavior that sacrifices reward now for reward in future episodes. Neither seem objective-myopic, meaning both of their objective functions are computed (seemingly necessarily) using far-reaching-consequences-understanding. Neither seem behavior-myopic, meaning both of them would successfully target far-reaching-consequences (by assumption of being competitive?). I think if you’re either objective-non-myopic or behavior-non-myopic, then by default you’re thought-non-myopic (meaning you in fact use far-reaching-consequences-understanding in your reasoning). I think if you’re thought-non-myopic, then by default you’re values-non-myopic, meaning you’re pursuing specific far-reaching-consequences. I think if you’re values-non-myopic, then you’re almost certainly deceptive, by strong default.
We’re just talking about step (1), so we’re not talking about training at all right now. We’re just trying to figure out what a natural class of agents would be that isn’t deceptive.
For step (1) we’re not trying to figure out what would happen by default if you trained a model on something, we’re just trying to understand what it might look like for an agent to be myopic in a natural way.
In step (1) you wrote:
I think it is possible to produce a simple, natural description of myopia such that myopic agents are still capable of doing all the powerful things we might want out of an AGI but such that they never have any reason to be deceptive
I think if something happens by default, that’s a kind of naturalness. Maybe I just want to strengthen the claims above to say “by strong default”. In other words, I’m saying it’s a priori very unnatural to have something that’s behavior-non-myopic but thought-myopic, or thought-non-myopic but not deceptive, and overcoming that unnaturality is a huge hurdle. I would definitely be interested in your positive reasons for thinking this is possible.
I think if you’re values-non-myopic, then you’re almost certainly deceptive, by strong default.
I think it would help if you tried to walk through how a model with the goal of “imitating Evan” ends up acting deceptively. I claim that as long as you have a notion of myopic imitation that rules out failure modes like acausal trade (e.g. LCDT) and Evan will never act deceptively, then such a model will never act deceptively.
Your steps (2)-(4) seem to rely fairly heavily on the naturality of the class described in (1), e.g. because (2) has to recognize (1)s which requires that we can point to (1)s. If by “with the [[sole?]] goal of imitating Evan” you mean that
A. the model is actually really *only* trying to imitate Evan,
B. the model is competent to not accidentally also try to do something else (e.g. because the ways it pursues its goal are themselves malign under distributional shift), and
C. the training process you use will not tip the internal dynamics of the model over into a strategically malign state (there was never any incentive to prevent that from happening any more robustly than just barely enough to get good answers on the training set, and I think we agree that there’s a whole pile of [ability to understand and pursue far-reaching consequences] sitting in the model, making strategically malign states pretty close in model-space for natural metrics),
then yes this would plausibly not be deceptive, but it seems like a very unnatural class. I tried to argue that it’s unnatural in the long paragraph with the different kinds of myopia, where “by (strong) default” = “it would be unnatural to be otherwise”.
Note that (A) and (B) are not actually that hard—e.g. LCDT solves both problems.
Your (C), in my opinion, is where all the action is, and is in fact the hardest part of this whole story—which is what I was trying to say in the original post when I said that (2) was the hard part.
Okay, I think I’m getting a little more where you’re coming from? Not sure. Maybe I’ll read the LCDT thing soon (though I’m pretty skeptical of those claims).
(Not sure if it’s useful to say this, but as a meta note, from my perspective the words in the post aren’t pinned down enough to make it at all clear that the hard part is (2) rather than (1); you say “natural” in (1), and I don’t know what you mean by that such that (1) isn’t hard.)
Maybe I’m not emphasizing how unnatural I think (A) is. Like, it’s barely even logically consistent. I know that (A) is logically consistent, for some funny construal of “only trying”, because Evan is a perfect imitation of Evan; and more generally a good WBE could maybe be appropriately construed as not trying to do anything other than imitate Evan; and ideally an FAI could be given an instruction so that it doesn’t, say, have any appreciable impacts other than the impacts of an Evan-imitation. For anything that’s remotely natural and not “shaped” like Evan is “shaped”, I’m not sure it even makes sense to be only trying to imitate Evan; to imitate Evan you have to do a whole lot of stuff, including strategically arranging cognition, reason about far-reaching consequences in general, etc., which already constitutes trying to do something other than imitating Evan. When you’re doing consequentialist reasoning, that already puts you very close in algorithm-space to malign strategic thinking, so “consequentialist but not deceptive (hence not malignly consequentialist)” is very unnatural; IMO like half of the whole the alignment problem is “get consequentialist reasoning that isn’t consequentalisting towards some random thing”.
There’s no big assemblage, just one single myopic optimizer.
I have no idea where you’re getting this idea of an assemblage from; nowhere did I say anything about that.
Imitating HCH is just an example, you could substitute in any other myopic objective that might be aligned and competitive instead.
If that’s how you want to define myopia/non-myopia then sure, you’re welcome to call an HCH imitator non-myopic. But that’s not the version of myopia that I’m working with/care about.
Huh. There’s definitely some miscommunication happening...
From the post:
You give HCH + iterative amplification as an example, which I responded to. You say that in general, recursion can allow myopic agents to do well on non-myopic objectives; this sure sounds like making a kind of assemblage in order to get non-myopicness. You link: https://www.lesswrong.com/posts/YWwzccGbcHMJMpT45/ai-safety-via-market-making , which I hadn’t seen before, but at a glance, it (1) predicts and manipulates humans, which are non-myopic reasoners, (2) involves iteration, and (3) an as additional component, uses Amp(M) (an assemblage of myopic reasoners, no?).
Oops, there’s more confusion here. HCH is a myopic objective? I could emit the sentence, “the AI is only trained to predict the answer given by HCH to the question that’s right in front of it”, but I don’t think I understand a perspective in which that’s really myopic, in the sense of not doing consequentialist reasoning about far-reaching plans, given that it’s predicting (1) humans (2) in a big assemblage that (3) by hypothesis successfully answer questions about far-reaching plans (and (4) using Amp, which is a big spot where generalization (e.g. consequentialist generalization) comes in). Could you point me towards a more detailed writeup / discussion about what’s meant by HCH being a relevantly myopic objective that responds to the objection about, well, its output does nevertheless get right answers to questions about far-reaching consequences?
I’m interested in whether objectives can be aligned and competitive and myopic. That still seems like the cat-belling step.
From point 1. of the OP:
My best current guess is that you’re saying something like, if the agent is myopic, that means it’s only trained to try to solve the problem right in front of it; so it’s not trained to hide its reasoning in order to game the system across multiple episodes? What’s the argument that this implies non-deceptiveness? (Link would be fine.) I was trying to say, if it’s predicting a far-conquences-understander, it has to do far-consequences-understanding, therefore it’s able to do far-consequences-understanding, therefore it’s (1) liable to, by default, in effect have values it pursues over far-consequences, and (2) is able to effectively pursue those values without further ado. The case for (2) is more clear, since arguendo it is able to do far-consequences-understanding. Maybe the case for (1) needs to be made.
The recursion there is only in the objective, not in the model itself. So there’s no assemblage anywhere other than in the thing that the model is trying to imitate.
Maybe it’ll be more clear to you if you just replace “imitate HCH” with “imitate Evan” or something like that—of course that’s less likely to result in a model that’s capable enough to do anything interesting, but it has the exact same sorts of problems in terms of getting myopia to work.
We’re just talking about step (1), so we’re not talking about training at all right now. We’re just trying to figure out what a natural class of agents would be that isn’t deceptive.
Agree. Any competitive myopic agent would have to be able to fully understand exactly how to do long-term non-myopic reasoning.
Agree by default but not by necessity. For step (1) we’re not trying to figure out what would happen by default if you trained a model on something, we’re just trying to understand what it might look like for an agent to be myopic in a natural way.
So these are both training-myopic, meaning they both are being trained only to do the task right in front of them, and aren’t (directly) rewarded for behavior that sacrifices reward now for reward in future episodes. Neither seem objective-myopic, meaning both of their objective functions are computed (seemingly necessarily) using far-reaching-consequences-understanding. Neither seem behavior-myopic, meaning both of them would successfully target far-reaching-consequences (by assumption of being competitive?). I think if you’re either objective-non-myopic or behavior-non-myopic, then by default you’re thought-non-myopic (meaning you in fact use far-reaching-consequences-understanding in your reasoning). I think if you’re thought-non-myopic, then by default you’re values-non-myopic, meaning you’re pursuing specific far-reaching-consequences. I think if you’re values-non-myopic, then you’re almost certainly deceptive, by strong default.
In step (1) you wrote:
I think if something happens by default, that’s a kind of naturalness. Maybe I just want to strengthen the claims above to say “by strong default”. In other words, I’m saying it’s a priori very unnatural to have something that’s behavior-non-myopic but thought-myopic, or thought-non-myopic but not deceptive, and overcoming that unnaturality is a huge hurdle. I would definitely be interested in your positive reasons for thinking this is possible.
I think it would help if you tried to walk through how a model with the goal of “imitating Evan” ends up acting deceptively. I claim that as long as you have a notion of myopic imitation that rules out failure modes like acausal trade (e.g. LCDT) and Evan will never act deceptively, then such a model will never act deceptively.
Your steps (2)-(4) seem to rely fairly heavily on the naturality of the class described in (1), e.g. because (2) has to recognize (1)s which requires that we can point to (1)s. If by “with the [[sole?]] goal of imitating Evan” you mean that
A. the model is actually really *only* trying to imitate Evan,
B. the model is competent to not accidentally also try to do something else (e.g. because the ways it pursues its goal are themselves malign under distributional shift), and
C. the training process you use will not tip the internal dynamics of the model over into a strategically malign state (there was never any incentive to prevent that from happening any more robustly than just barely enough to get good answers on the training set, and I think we agree that there’s a whole pile of [ability to understand and pursue far-reaching consequences] sitting in the model, making strategically malign states pretty close in model-space for natural metrics),
then yes this would plausibly not be deceptive, but it seems like a very unnatural class. I tried to argue that it’s unnatural in the long paragraph with the different kinds of myopia, where “by (strong) default” = “it would be unnatural to be otherwise”.
Note that (A) and (B) are not actually that hard—e.g. LCDT solves both problems.
Your (C), in my opinion, is where all the action is, and is in fact the hardest part of this whole story—which is what I was trying to say in the original post when I said that (2) was the hard part.
Okay, I think I’m getting a little more where you’re coming from? Not sure. Maybe I’ll read the LCDT thing soon (though I’m pretty skeptical of those claims).
(Not sure if it’s useful to say this, but as a meta note, from my perspective the words in the post aren’t pinned down enough to make it at all clear that the hard part is (2) rather than (1); you say “natural” in (1), and I don’t know what you mean by that such that (1) isn’t hard.)
Maybe I’m not emphasizing how unnatural I think (A) is. Like, it’s barely even logically consistent. I know that (A) is logically consistent, for some funny construal of “only trying”, because Evan is a perfect imitation of Evan; and more generally a good WBE could maybe be appropriately construed as not trying to do anything other than imitate Evan; and ideally an FAI could be given an instruction so that it doesn’t, say, have any appreciable impacts other than the impacts of an Evan-imitation. For anything that’s remotely natural and not “shaped” like Evan is “shaped”, I’m not sure it even makes sense to be only trying to imitate Evan; to imitate Evan you have to do a whole lot of stuff, including strategically arranging cognition, reason about far-reaching consequences in general, etc., which already constitutes trying to do something other than imitating Evan. When you’re doing consequentialist reasoning, that already puts you very close in algorithm-space to malign strategic thinking, so “consequentialist but not deceptive (hence not malignly consequentialist)” is very unnatural; IMO like half of the whole the alignment problem is “get consequentialist reasoning that isn’t consequentalisting towards some random thing”.