Isn’t this necessary for the shutdown safe desideratum?
I don’t remember which desideratum that is, can’t ctrl+f it, and honestly this post is pretty long, so I don’t know. At any rate, I’m not very confident in any alleged implications between impact desiderata that are supposed to generalise over all possible impact measures—see the ones that couldn’t be simultaneously satisfied until this one did.
Can you give me examples of good low impact plans we couldn’t do without offsetting?
One case where you need ‘offsetting’, as defined in this piece but not necessarily as I would define it: suppose you want to start an intelligent species to live on a single new planet. If you create the species and then do nothing, they will spread to many many planets and do a bunch of crazy stuff, but if you have a stern chat with them after you create them, they’ll realise that staying on their planet is a pretty good idea. In this case, I claim that the correct course of action is to create the species and have a stern chat, not to never create the species. In general, sometimes there are safe plans with unsafe prefixes and that’s fine.
A more funky case that’s sort of outside what you’re trying to solve is when your model improves over time, so that something that you thought would have low impact will actually have high impact in the future if you don’t act now to prevent it. (this actually provokes an interesting desideratum for impact measures in general—how do they interplay with shifting models?)
[EDIT: a more mundane example is that driving on the highway is a situation where suddenly changing your plan to no-ops can cause literal impacts in an unsafe way, nevertheless driving competently is not a high-impact plan]
Can you expand on why [normality and the world where the AI is acting] are distinct in your view?
Normality is an abstraction over things like the actual present moment when I type this comment. The world where the AI is acting has the potential to be quite a different one, especially if the AI accidentally did something unsafe that could be fixed but hasn’t been yet.
The attainable utility calculation seems to take care of this by considering the value of the best plan from that vantage point
I don’t understand: the attainable utility calculation (by which I assume you mean the definition of Qu) involves a utility function being called on a sub-history. The thing I am looking for is how to define a utility function on a subhistory when you’re only specifying the value of that function on full histories, or alternatively what info needs to be specified for that to be well defined.
Couldn’t you equally design a species that won’t spread to begin with?
A more funky case that’s sort of outside what you’re trying to solve is when your model improves over time, so that something that you thought would have low impact will actually have high impact in the future if you don’t act now to prevent it. (this actually provokes an interesting desideratum for impact measures in general—how do they interplay with shifting models?)
I think the crux here is that I think that a low impact agent should make plans which are low impact both in parts and in whole, acting with respect to the present moment to the best of its knowledge, avoiding value judgments about what should be offset by not offsetting. In a nutshell, my view is that low impact should be with respect to what the agent is doing, and not something enforced on the environment. How does a safe pro-offsetting impact measure decide what to offset (including pre-activation effects) without requiring value judgment?
Do note that intent verification doesn’t seem to screen off what you might call “natural” ex ante offsetting, so I don’t really see what we’re missing out on still.
Edit: The driving example is a classic point brought up, totally valid. As I mentioned elsewhere, a chauffeur-u_A could construct a self-driving car whose activation would require only a single action, and this should pass (the weaker form of) intent verification.
I think it’s in the true there are situations in which we would want an offset to happen, but it seems to me like we can just avoid problematic situations which require that to begin with. If the agent makes a mistake, we can shut it off and then we do the offsetting. I mentioned model accuracy in open questions, I think the jury is definitely still out on that.
Normality is an abstraction over things like the actual present moment when I type this comment. The world where the AI is acting has the potential to be quite a different one, especially if the AI accidentally did something unsafe that could be fixed but hasn’t been yet.
Oh, so it’s an issue with a potential shift. But why would AUP allow the agent to stray (more than its budget) away from the normality of its activation moment?
how to define a utility function on a subhistory when you’re only specifying the value of that function on full histories
Subhistories beginning with an action and ending with an observation are also histories, so their value is already specified.
This comment is very scattered, I’ve tried to group it into two sections for reading convenience.
Desiderata of impact regularisation techniques
Couldn’t you equally design a species that won’t spread to begin with?
Well, maybe you could, maybe you couldn’t. I think that to work well, an impact regularising scheme should be able to handle worlds where you couldn’t.
I think that a low impact agent should make plans which are low impact both in parts and in whole, acting with respect to the present moment to the best of its knowledge, avoiding value judgments about what should be offset by not offsetting.
I disagree with this, in that I don’t see how it connects to the real world reason that we would like low impact AI. It does seem to be the crux.
How does a safe pro-offsetting impact measure decide what to offset (including pre-activation effects) without requiring value judgment?
I don’t know, and it doesn’t seem obvious to me that any sensible impact measure is possible. In fact, during the composition of this comment, I’ve become more pessimistic about the prospects for one. I think that this might be related to the crux above?
Do note that intent verification doesn’t seem to screen off what you might call “natural” ex ante offsetting, so I don’t really see what we’re missing out on still.
I don’t really understand what you mean here, could you spend two more sentences on it?
As I mentioned elsewhere, a chauffeur-u_A could construct a self-driving car whose activation would require only a single action, and this should pass (the weaker form of) intent verification.
This is really interesting, and suggests to me that in general this agent might act by creating a successor that carries out a globally-low-impact plan, and then performing the null action thereafter. Note that this successor agent wouldn’t be as interruptible as the original agent, which I guess is somewhat unfortunate.
Technical discussion of AUP
But why would AUP allow the agent to stray (more than its budget) away from the normality of its activation moment?
It would not, but it’s brittle to accidents that cause them to diverge. These accidents both include ones caused by the agent e.g. during the learning process; and ones not caused by the agent e.g. a natural disaster suddenly occurs that is on course to wipe out humans, and the AUP agent isn’t allowed to stop it because that would be too high impact.
Subhistories beginning with an action and ending with an observation are also histories, so their value is already specified.
This causes pretty weird behaviour. Imagine an agent’s goal is to do a dance for the first action of their life, and then do nothing. Then, for any history, the utility function is 1 if that history starts with a dance and 0 otherwise. When AUP thinks about how this goal’s ability to be satisfied changes over time at the end of the first timestep, it will imagine that all that matters is whether the agent can dance on the second timestep, since that action is the first action in the history that is fed into the utility function when computing the relevant Q-value.
Well, maybe you could, maybe you couldn’t. I think that to work well, an impact regularising scheme should be able to handle worlds where you couldn’t.
So it seems that on one hand we are assuming that the agent can come up with really clever ways of getting around the impact measure. But when it comes to using the impact measure, we seem to be insisting that it follow the first method that comes to mind. That is, people say “the measure doesn’t let us do X in this way!”, and they’re right. I then point out a way in which X can be done, but people don’t seem to be satisfied with that. This confuses me.
The point of the impact measure isn’t to choose the exact plan that we would use, but rather to disallow overly-impactful plans and allow us to complete a range of goals in some low-impact way. I don’t think we should care about which way that is, as long as it isn’t dangerous.
But perhaps I’m being unreasonable, and there are some hypothetical worlds and goals for which this argument doesn’t work. Here’s why I think the method is generally sufficient: suppose that the objective cannot be completed at all without doing some high-impact plan. Then by N-incrementing, the first plan that reaches the goal will be the minimal plan that has this necessary impact, without the extra baggage of unnecessary, undesirable effects.
[note: this supposes that there aren’t undesirable pseudo-ways of reaching the goal before we reach the outcome in mind. This seems plausible due to the structuring of the measure, but shouldn’t be taken for granted.]
Analogously, I am saying that we can seemingly get all the low-impact results we need without offsetting using AUP. You point out specific plans which would be allowed if we could offset in a reasonable way. I say that that problem seems really hard, but it looks like my method lets us get effectively the same thing done without needing to figure that out.
I don’t know, and it doesn’t seem obvious to me that any sensible impact measure is possible.
I’m mostly confused because there’s substantial focus on the fact AUP penalizes specific plans (although I definitely agree that some hypothetical measure which does assign impact according to our exact intuitions would be better than one that’s conservative), instead of realizing AUP can seemingly do whatever we need in some way (for which I think I give a pretty decent argument above), and also has nice properties to work with in general (like seemingly not taking off, acausally cooperating, acting to survive, etc). I’m cautiously hopeful that these properties are going to open really important doors.
“Do note that intent verification doesn’t seem to screen off what you might call “natural” ex ante offsetting, so I don’t really see what we’re missing out on still.”
I don’t really understand what you mean here, could you spend two more sentences on it?
It allows plans like the chauffeur example, while seemingly disallowing weird cheats.
Technical discussion of AUP
These accidents both include ones caused by the agent e.g. during the learning process
Yes, but I think this can be fixed by just not allowing dumb agents near really high impact opportunities. By the time that they would be able to purposefully construct a plan that is high impact to better pursue their goals, they already (by supposition) have enough model richness to plot the consequences, so I don’t see how this is a non-trivial risk.
This seems to more generally just be a problem with not knowing what you don’t know, and the method is compatible with whatever solutions we do come up with. Furthermore, instead of needing to know whether effects are bad, the agent only needs to know whether they are big (I just realized this now!). This is already an improvement on the state-of-the-art for safe learning, as I understand it. That is, AUP becomes far less likely to do things as soon as it realizes that their consequences are big—instead of waiting for us to tell it that the consequences are bad.
a natural disaster suddenly occurs that is on course to wipe out humans, and the AUP agent isn’t allowed to stop it because that would be too high impact.
Because I claim this is high impact, and not the job of a low impact agent. I think a more sensible use of a low-impact agent would be as a technical oracle, which could help us design an agent which would do this. Making this not useless is not trivial, but that’s for a later post. I think it might be possible, and more appropriate than using it for something as large as protection from natural disasters.
This causes pretty weird behaviour. Imagine an agent’s goal is to do a dance for the first action of their life, and then do nothing. Then, for any history, the utility function is 1 if that history starts with a dance and 0 otherwise. When AUP thinks about how this goal’s ability to be satisfied changes over time at the end of the first timestep, it will imagine that all that matters is whether the agent can dance on the second timestep, since that action is the first action in the history that is fed into the utility function when computing the relevant Q-value.
Why is this weird behavior? If it has a dance action, it should always be able to execute this action? It retains the dance action, if we’re actually using this, and then turns into a pure measure of power (u_1 - can it remain activated for the remainder of the attainable horizon, in order to ensure it retains the 1 utility rating?), which I have argued tracks what we want.
So it seems to me like on one hand we are assuming that the agent can come up with really clever ways of getting around the impact measure. But when it comes to using the impact measure, we seem to be insisting that it follow the first way that comes to mind. That is, people say “the measure doesn’t let us do X in this way!”, and they’re right. I then point out a way in which X can be done, but people don’t seem to be satisfied with that. This confuses me.
So there’s a narrow answer and a broad answer here. The narrow answer is that if you tell me that AUP won’t allow plan X but will allow plan Y, then I have to be convinced that Y will be possible whenever X was, and that this is also true for X’ that are pretty similar to X along the relevant dimension that made me bring up X. This is a substantial, but not impossible, bar to meet.
The broad answer is that if I want to figure out if AUP is a good impact regularisation technique, then one of the easiest ways I can do that is to check a plan that seems like it obviously should or should not be allowed, and then check if it is or is not allowed. This lets me check if AUP is identical to my internal sense of whether things obviously should or should not be allowed. If it is, then great, and if it’s not, then I might worry that it will run into substantial trouble in complicated scenarios that I can’t really picture. It’s a nice method of analysis because it requires few assumptions about what things are possible in what environments (compared to “look at a bunch of environments and see if the plans AUP comes up with should be allowed”) and minimal philosophising (compared to “meditate on the equations and see if they’re analytically identical to how I feel impact should be defined”).
[EDIT: added content to this section]
Because I claim [that saving humanity from natural disasters] is high impact, and not the job of a low impact agent. I think a more sensible use of a low-impact agent would be as a technical oracle, which could help us design an agent which would do this. Making this not useless is not trivial, but that’s for a later post. I think it might be possible, and more appropriate than using it for something as large as protection from natural disasters.
Firstly, saving humanity from natural disasters doesn’t at all seem like the thing I was worried about when I decided that I needed impact regularisation, and seems like it’s plausibly in a different natural reference class than causing natural disasters. Secondly, your description of a use case for a low-impact agent is interesting and one that I hadn’t thought of before, but I still would hope that they could be used in a wider range of settings (basically, whenever I’m worried that a utility function has an unforeseen maximum that incentivises extreme behaviour).
if you tell me that AUP won’t allow plan X but will allow plan Y, then I have to be convinced that Y will be possible whenever X was, and that this is also true for X’ that are pretty similar to X along the relevant dimension that made me bring up X.
I think there is an argument for this whenever we have “it won’t X because anti-survival incentive incentive and personal risk”: “then it builds a narrow subagent to do X”.
The broad answer is that if I want to figure out if AUP is a good impact regularisation technique, then one of the easiest ways I can do that is to check a plan that seems like it obviously should or should not be allowed,
As I said in my other comment, I think we have reasonable evidence that it’s hitting the should-nots, which is arguably more important for this kind of measure. The question is, how can we let it allow more shoulds?
Firstly, saving humanity from natural disasters doesn’t at all seem like the thing I was worried about when I decided that I needed impact regularisation, and seems like it’s plausibly in a different natural reference class than causing natural disasters.
Why would that be so? That doesn’t seem value agnostic. I do think that the approval incentives help us implicitly draw this boundary, as I mentioned in the other comment.
I still would hope that they could be used in a wider range of settings (basically, whenever I’m worried that a utility function has an unforeseen maximum that incentivises extreme behaviour).
I agree. I’m not saying that the method won’t work for these, to clarify.
I think we have reasonable evidence that it’s hitting the should-nots, which is arguably more important for this kind of measure. The question is, how can we let it allow more shoulds?
Two points:
Firstly, the first section of this comment by Rohin models my opinions quite well, which is why some sort of asymmetry bothers me. Another angle on this is that I think it’s going to be non-trivial to relax an impact measure to allow enough low-impact plans without also allowing a bunch of high-impact plans.
Secondly, here and in other places I get the sense that you want comments to be about the best successor theory to AUP as outlined here. I think that what this best successor theory is like is an important one when figuring out whether you have a good line of research going or not. That being said, I have no idea what the best successor theory is like. All I know is what’s in this post, and I’m much better at figuring out what will happen with the thing in the post than figuring out what will happen with the best successors, so that’s what I’m primarily doing.
Firstly, saving humanity from natural disasters… seems like it’s plausibly in a different natural reference class than causing natural disasters.
Why would that be so? That doesn’t seem value agnostic.
It seems value agnostic to me because it can be generated from the urge ‘keep the world basically like how it used to be’.
I have no idea what the best successor theory is like. All I know is what’s in this post, and I’m much better at figuring out what will happen with the thing in the post than figuring out what will happen with the best successors, so that’s what I’m primarily doing.
But in this same comment, you also say
I think it’s going to be non-trivial to relax an impact measure
People keep saying things like this, and it might be true. But on what data are we basing this? Have we tried relaxing an impact measure, given that we have a conceptual core in hand?
I’m making my predictions based off of my experience working with the method. The reason that many of the flaws are on the list is not because I don’t think I could find a way around them, but rather because I’m one person with a limited amount of time. It will probably turn out that some of them are non-trivial, but pre-judging them doesn’t seem very appropriate.
I indeed want people to share their ideas for improving the measure. I also welcome questioning specific problems or pointing out new ones I hadn’t noticed. However, arguing whether certain problems subjectively seem hard or maybe insurmountable isn’t necessarily helpful at this point in time. As you said in another comment,
I’m not very confident in any alleged implications between impact desiderata that are supposed to generalise over all possible impact measures—see the ones that couldn’t be simultaneously satisfied until this one did.
.
It seems value agnostic to me because it can be generated from the urge ‘keep the world basically like how it used to be’.
True, but avoiding lock-in seems value laden for any approach doing that, reducing back to the full problem: what “kinds of things” can change? Even if we knew that, who can change things? But this is the clinginess / scapegoating tradeoff again.
Primarily does not mean exclusively, and lack of confidence in implications between desiderata doesn’t imply lack of confidence in opinions about how to modify impact measures, which itself doesn’t imply lack of opinions about how to modify impact measures.
People keep saying things like [‘it’s non-trivial to relax impact measures’], and it might be true. But on what data are we basing this?
This is according to my intuitions about what theories do what things, which have had as input a bunch of learning mathematics, reading about algorithms in AI, and thinking about impact measures. This isn’t a rigorous argument, or even necessarily an extremely reliable method of ascertaining truth (I’m probably quite sub-optimal in converting experience into intuitions), but it’s still my impulse.
True, but avoiding lock-in seems value laden for any approach doing that, reducing back to the full problem: what “kinds of things” can change? Even if we knew that, who can change things? But this is the clinginess / scapegoating tradeoff again.
My sense is that we agree that this looks hard but shouldn’t be dismissed as impossible.
People keep saying things like this, and it might be true. But on what data are we basing this? Have we tried relaxing an impact measure, given that we have a conceptual core in hand?
What? I’ve never tried to write an algorithm to search an unordered set of numbers in O(log n) time, yet I’m quite certain it can’t be done. It is possible to make a real claim about X without having tried to do X. Granted, all else equal trying to do X will probably make your claims about X more likely to be true (but I can think of cases where this is false as well).
I’m clearly not saying you can never predict things before trying them, I’m saying that I haven’t seen evidence that this particular problem is more or less challenging than dozens of similar-feeling issues I handled while constructing AUP.
That is, people say “the measure doesn’t let us do X in this way!”, and they’re right. I then point out a way in which X can be done, but people don’t seem to be satisfied with that.
Going back to this, what is the way you propose the species-creating goal be done? Say, imposing the constraint that the species has got to be basically just human (because we like humans) and you don’t get to program their DNA in advance? My guess at your answer is “create a sub-agent that reliably just does the stern talking-to in the way the original agent would”, but I’m not certain.
My real answer: we probably shouldn’t? Creating sentient life that has even slightly different morals seems like a very morally precarious thing to do without significant thought. (See the cheese post, can’t find it)
and you don’t get to program their DNA in advance?
Uh, why not?
Make humans that will predictably end up deciding not to colonize the galaxy or build superintelligences.
Creating sentient life that has even slightly different morals seems like a very morally precarious thing to do without significant thought.
I guess I’m more comfortable with procreation than you are :)
I imposed the “you don’t get to program their DNA in advance” constraint since it seems plausible to me that if you want to create a new colony of actual humans, you don’t have sufficient degrees of human to make them actually human-like but also docile enough.
You could imagine a similar task of “build a rather powerful AI system that is transparent and able to be monitored”, where perhaps ongoing supervision is required, but that’s not an onerous burden.
But perhaps I’m being unreasonable, and there are some hypothetical worlds and goals for which this argument doesn’t work. Here’s why I think the method is generally sufficient: suppose that the objective cannot be completed at all without doing some high-impact plan. Then by N-incrementing, the first plan that reaches the goal will be the minimal plan that has this necessary impact, without the extra baggage of unnecessary, undesirable effects.
This is only convincing to the extent that I buy into AUP’s notion of impact. My general impression is that it seems vaguely sketchy (due to things that I consider low-impact being calculated as high-impact) and is not analytically identical to the core thing that I care about (human ability to achieve goals that humans plausibly care about), but may well turn out to be fine if I considered it for a long time.
I’m mostly confused because there’s substantial focus on the fact AUP penalizes specific plans (although I definitely agree that some hypothetical measure which does assign impact according to our exact intuitions would be better than one that’s conservative), instead of realizing AUP can seemingly do whatever we need in some way (for which I think I give a pretty decent argument above), and also has nice properties to work with in general (like seemingly not taking off, acausally cooperating, acting to survive, etc). I’m cautiously hopeful that these properties are going to open really important doors.
I agree that the nice properties of AUP are pretty nice and demonstrate a significant advance in the state of the art for impact regularisation, and did indeed put that in my first bullet point of what I thought of AUP, although I guess I didn’t have much to say about it.
Yes, but I think this can be fixed by just not allowing dumb agents near really high impact opportunities. By the time that they would be able to purposefully construct a plan that is high impact to better pursue their goals, they already (by supposition) have enough model richness to plot the consequences, so I don’t see how this is a non-trivial risk.
This is a good point against worrying about an AUP agent that once acted against the AUP objective, but I have some residual concern both in the form of (a) this feels like wrong behaviour and maybe points to wrongness that manifests in harmful ways (see sibling comment) and (b) even with a good model, presumably if it’s run for a long time there might be at least one error, and I’m inherently worried by a protocol that fails ungracefully if it stops being followed at any one point in time. However, I think the stronger objection here is the ‘natural disaster’ category (which might include an actuator in the AUP agent going haywire or any number of things).
Because I claim [that saving humanity from natural disasters] is high impact, and not the job of a low impact agent. I think a more sensible use of a low-impact agent would be as a technical oracle, which could help us design an agent which would do this. Making this not useless is not trivial, but that’s for a later post. I think it might be possible, and more appropriate than using it for something as large as protection from natural disasters.
Note that AUP would not even notify humans that such a natural disaster was happening if it thought that humans would solve the natural disaster iff they were notified. In general, AFAICT, if you have a natural-disaster warning AUP agent, then it’s allowed to warn humans of a natural disaster iff it’s allowed to cause a natural disaster (I think even impact verification doesn’t prevent this, if you imagine that causing a natural disaster is an unforeseen maximum of the agent’s utility function). This seems like a failure mode that impact regularisation techniques ought to prevent. I also have a different reaction to this section in the sibling comment.
My general impression is that it seems vaguely sketchy (due to things that I consider low-impact being calculated as high-impact)
I think it should be quite possible for us to de-sketchify the impact measure in the ways you pointed out. Up to now, I focused more on ensuring that there aren’t errors of the other type: where high impact plans sneak through as low impact. I’m currently not aware of any, although that isn’t to say they don’t exist.
Also, the fact that we can now talk about precisely what we think impact is with respect to goals makes me more optimistic. I don’t think it unlikely that there exist better, cleaner formulations of what I provided. Perhaps they somehow don’t have the bothersome false positives you’ve pointed out. After all, compared to many folks in the community, I’m fairly mathematically inexperienced, and have only been working on this for a relatively short amount of time.
This is a good point against worrying about an AUP agent that once acted against the AUP objective, but I have some residual concern both in the form of (a) this feels like wrong behaviour and maybe points to wrongness that manifests in harmful ways
What is “this” here (for a)?
I’m inherently worried by a protocol that fails ungracefully if it stops being followed at any one point in time
But AUP’s plans are shutdown-safe? I think I misunderstand.
then it’s allowed to warn humans of a natural disaster iff it’s allowed to cause a natural disaster
I actually think that AUP agents would prevent natural disasters which wouldn’t disable the agent itself. Also, your claim is not true, due to approval incentives and the fact that an agent incentivized to save us from disasters wouldn’t get any extra utility by causing disasters (unless it also wanted to save us from these, but it seems like this would only happen for higher impact levels and would be discouraged by approval incentives).
In general, I expect AUP to also work for disaster prevention, as long as its own survival isn’t affected. One complication is that we would have to allow it to remain on, even if it didn’t save us from disasters, but shut it off if it caused any. I think that’s pretty reasonable, as we expect our low impact agents to not do anything sometimes.
Also, the fact that we can now talk about precisely what we think impact is with respect to goals makes me more optimistic.
To be frank, although I do like the fact that there’s a nice concrete candidate definition of impact, I am not excited by it by more than a factor of two over other candidate impact definitions, and would not say that it encapsulates what I think impact is.
… (a) this feels like wrong behaviour and maybe points to wrongness that manifests in harmful ways
What is “this” here (for a)?
“This” is “upon hypothetically performing some high-impact action, try not to change attainable utilities from that baseline”, and it’s what I mean by “ungracefully failing if the protocol stops being followed at any one point in time”.
then it’s allowed to warn humans of a natural disaster iff it’s allowed to cause a natural disaster
I actually think that AUP agents would prevent natural disasters which wouldn’t disable the agent itself. Also, your claim is not true, due to approval incentives and the fact that an agent incentivized to save us from disasters wouldn’t get any extra utility by causing disasters
Regarding whether AUP agents would prevent natural disasters: AFAICT if humans have any control over the agent, or any ways of making it harder for the agent to achieve a wide variety of goals, then preventing their demise (and presumably the demise of their control over the AUP agent) would be high-AUP-impact, since it would impede the agent’s ability to achieve a wide variety of goals.
Regarding approval incentive: my understanding is that in AUP this only acts to incentivise actual approval (as opposed to hypothetical maximally informed approval). One could cause a natural disaster without humans being aware of it unless there was quite good interpretability, which I wasn’t taking as an assumption that you were making.
Regarding the lack of incentive to cause disasters: in my head, the point of impact regularisation techniques is to stop agents from doing something crazy in cases where doing something crazy is an unforeseen convenient way for the agent to achieve its objective. As such, I consider it fair game to consider cases where there is an unforeseen incentive to do crazy things, if the argument generalises over a wide variety of craziness, which I think this one does sort of OK.
“This” is “upon hypothetically performing some high-impact action, try not to change attainable utilities from that baseline”, and it’s what I mean by “ungracefully failing if the protocol stops being followed at any one point in time”.
Huh? So if the safety measure stops working for some reason, it’s no longer safe? But if it does make a mistake, it’s more inclined to allow us to shut it down. Compare this to an offsetting approach, where it can keep doing and undoing things to an arbitrarily-large degree.
AUP agent does a big thing and bites the penalty. If that big thing was bad, we shut it down. Why would you instead prefer that it keep doing things to make up for it, when its model wasn’t even good enough to predict we wouldn’t like it?
This feels like an odd standard, where you say “but maybe it randomly fails and then doesn’t work”, or “it can’t anticipate things it doesn’t know about”. While these are problems, they aren’t for low impact to resolve, but the approach also happens to help anyways.
AFAICT if humans have any control over the agent, or any ways of making it harder for the agent to achieve a wide variety of goals, then preventing their demise (and presumably the demise of their control over the AUP agent) would be high-AUP-impact, since it would impede the agent’s ability to achieve a wide variety of goals.
This is true. It depends what the scale is—I had “remote local disaster” in mind, while you maybe had x-risk.
[Note that we could Bayes-update off of its canary in general, if we trust its model to an extent. This also deserves exploration as a binary “extinction?” oracle, with the sequential deployment of agents allowing mitigation of specific model flaws.]
Regarding approval incentive: my understanding is that in AUP this only acts to incentivise actual approval (as opposed to hypothetical maximally informed approval). One could cause a natural disaster without humans being aware of it unless there was quite good interpretability, which I wasn’t taking as an assumption that you were making.
We also aren’t assuming the machinery is so opaque that it has extremely negligible chance of being caught, even under scrutiny (although this is possible. I have a rough intuition the strength of approval will override the fairly high likelihood of getting away with it). Making yourself purposefully opaque seems convergent.
This feels like an odd standard, where you say “but maybe it randomly fails and then doesn’t work”, or “it can’t anticipate things it doesn’t know about”.
I want to point to the difference between behavioural cloning and reward methods for the problem of learning locomotion for robots. Behavioural cloning is where you learn what a human will do in any situation and act that way, while reward methods take a reward function (either learned or specified) that encourages locomotion and learn to maximise that reward function. An issue with behavioural cloning is that it’s unstable: if you get what the human would do slightly wrong, then you move to a state the human is less likely to be in, so your model gets worse, so you’re more likely to act incorrectly (both in the sense of “higher probability of incorrect actions” and “more probability of more extremely incorrect answers”), and so you go to more unusual states, etc. In contrast, reward methods promise to be more stable, since the Q-values generated by the reward function tend to be more valid even in unusual states. This is the story that I’ve heard for why behavioural cloning techniques are less prominent[*] than reward methods. In general, it’s bad if your machine learning technique amplifies rather than mitigates errors, either during training or during execution.
My claim here is not quite that AUP amplifies ‘errors’ (in this case, differences between how the world will turn out and normality), but that it preserves them rather than mitigates them. This is in contrast to methods that measure divergence to the starting state, or what the world would be like given that the agent had only performed no-ops after the starting state, resulting in a tendency to mitigate these ‘errors’. At any rate, even if no other method mitigated these ‘errors’, I would still want them to.
It depends what the scale is—I had “remote local disaster” in mind, while you maybe had x-risk.
I wasn’t necessarily imagining x-risk, but maybe something like an earthquake along the San Andreas fault, disrupting the San Franciscan engineers that would be supervising the agents.
We also aren’t assuming the machinery is so opaque that it has extremely negligible chance of being caught, even under scrutiny.
My impression is that most machine learning systems are extremely opaque to currently available analysis tools in the relevant fashion. I think that work to alleviate this opacity is extremely important, but not something that I would assume without mentioning it.
[*] Work is in fact done on behavioural cloning today, but with attempts to increase its stability.
Perhaps we could have it recalculate past impacts? It seems like that could maybe lead to it regaining ability to act, which could also be negative.
Edit:
My claim here is not quite that AUP amplifies ‘errors’ (in this case, differences between how the world will turn out and normality), but that it preserves them rather than mitigates them.
But if its model was wrong and it does something that it now infers was bad (because we are now moving to shut it down), its model is still probably incorrect. So it seems like what we want it to do is just nothing, letting us clean up the mess. If its model is probably still incorrect, even if we had a direction in which it thought it should mitigate, why should we expect this second attempt to be correct? I disagree presently that agent mitigation is the desirable behavior after model errors.
Perhaps we could have it recalculate past impacts?
Yeah, I have a sense that having the penalty be over the actual history and action versus the plan of no-ops since birth will resolve this issue.
But if its model was wrong and it does something that it now infers was bad (because we are now moving to shut it down), its model is still probably incorrect. So it seems like what we want it to do is just nothing, letting us clean up the mess.
I agree that if it infers that it did something bad because humans are now moving to shut it down, it should probably just do nothing and let us fix things up. However, it might be a while until the humans move to shut it down, if they don’t understand what’s happened. In this scenario, I think you should see the preservation of ‘errors’ in the sense of the agent’s future under no-ops differing from ‘normality’.
If ‘errors’ happen due to a mismatch between the model and reality, I agree that the agent shouldn’t try to fix them with the bits of the model that are broken. However, I just don’t think that that describes many of the things that cause ‘errors’: those can be foreseen natural events (e.g. San Andreas earthquake if you’re good at predicting earthquake), unlikely but possible natural events (e.g. San Andreas earthquake if you’re not good at predicting earthquakes), or unlikely consequences of actions. In these situations, agent mitigation still seems like the right approach to me.
I don’t remember which desideratum that is, can’t ctrl+f it, and honestly this post is pretty long, so I don’t know. At any rate, I’m not very confident in any alleged implications between impact desiderata that are supposed to generalise over all possible impact measures—see the ones that couldn’t be simultaneously satisfied until this one did.
One case where you need ‘offsetting’, as defined in this piece but not necessarily as I would define it: suppose you want to start an intelligent species to live on a single new planet. If you create the species and then do nothing, they will spread to many many planets and do a bunch of crazy stuff, but if you have a stern chat with them after you create them, they’ll realise that staying on their planet is a pretty good idea. In this case, I claim that the correct course of action is to create the species and have a stern chat, not to never create the species. In general, sometimes there are safe plans with unsafe prefixes and that’s fine.
A more funky case that’s sort of outside what you’re trying to solve is when your model improves over time, so that something that you thought would have low impact will actually have high impact in the future if you don’t act now to prevent it. (this actually provokes an interesting desideratum for impact measures in general—how do they interplay with shifting models?)
[EDIT: a more mundane example is that driving on the highway is a situation where suddenly changing your plan to no-ops can cause literal impacts in an unsafe way, nevertheless driving competently is not a high-impact plan]
Normality is an abstraction over things like the actual present moment when I type this comment. The world where the AI is acting has the potential to be quite a different one, especially if the AI accidentally did something unsafe that could be fixed but hasn’t been yet.
I don’t understand: the attainable utility calculation (by which I assume you mean the definition of Qu) involves a utility function being called on a sub-history. The thing I am looking for is how to define a utility function on a subhistory when you’re only specifying the value of that function on full histories, or alternatively what info needs to be specified for that to be well defined.
Couldn’t you equally design a species that won’t spread to begin with?
I think the crux here is that I think that a low impact agent should make plans which are low impact both in parts and in whole, acting with respect to the present moment to the best of its knowledge, avoiding value judgments about what should be offset by not offsetting. In a nutshell, my view is that low impact should be with respect to what the agent is doing, and not something enforced on the environment. How does a safe pro-offsetting impact measure decide what to offset (including pre-activation effects) without requiring value judgment?
Do note that intent verification doesn’t seem to screen off what you might call “natural” ex ante offsetting, so I don’t really see what we’re missing out on still.
Edit: The driving example is a classic point brought up, totally valid. As I mentioned elsewhere, a chauffeur-u_A could construct a self-driving car whose activation would require only a single action, and this should pass (the weaker form of) intent verification.
I think it’s in the true there are situations in which we would want an offset to happen, but it seems to me like we can just avoid problematic situations which require that to begin with. If the agent makes a mistake, we can shut it off and then we do the offsetting. I mentioned model accuracy in open questions, I think the jury is definitely still out on that.
Oh, so it’s an issue with a potential shift. But why would AUP allow the agent to stray (more than its budget) away from the normality of its activation moment?
Subhistories beginning with an action and ending with an observation are also histories, so their value is already specified.
This comment is very scattered, I’ve tried to group it into two sections for reading convenience.
Desiderata of impact regularisation techniques
Well, maybe you could, maybe you couldn’t. I think that to work well, an impact regularising scheme should be able to handle worlds where you couldn’t.
I disagree with this, in that I don’t see how it connects to the real world reason that we would like low impact AI. It does seem to be the crux.
I don’t know, and it doesn’t seem obvious to me that any sensible impact measure is possible. In fact, during the composition of this comment, I’ve become more pessimistic about the prospects for one. I think that this might be related to the crux above?
I don’t really understand what you mean here, could you spend two more sentences on it?
This is really interesting, and suggests to me that in general this agent might act by creating a successor that carries out a globally-low-impact plan, and then performing the null action thereafter. Note that this successor agent wouldn’t be as interruptible as the original agent, which I guess is somewhat unfortunate.
Technical discussion of AUP
It would not, but it’s brittle to accidents that cause them to diverge. These accidents both include ones caused by the agent e.g. during the learning process; and ones not caused by the agent e.g. a natural disaster suddenly occurs that is on course to wipe out humans, and the AUP agent isn’t allowed to stop it because that would be too high impact.
This causes pretty weird behaviour. Imagine an agent’s goal is to do a dance for the first action of their life, and then do nothing. Then, for any history, the utility function is 1 if that history starts with a dance and 0 otherwise. When AUP thinks about how this goal’s ability to be satisfied changes over time at the end of the first timestep, it will imagine that all that matters is whether the agent can dance on the second timestep, since that action is the first action in the history that is fed into the utility function when computing the relevant Q-value.
Desiderata of impact regularisation techniques
So it seems that on one hand we are assuming that the agent can come up with really clever ways of getting around the impact measure. But when it comes to using the impact measure, we seem to be insisting that it follow the first method that comes to mind. That is, people say “the measure doesn’t let us do X in this way!”, and they’re right. I then point out a way in which X can be done, but people don’t seem to be satisfied with that. This confuses me.
The point of the impact measure isn’t to choose the exact plan that we would use, but rather to disallow overly-impactful plans and allow us to complete a range of goals in some low-impact way. I don’t think we should care about which way that is, as long as it isn’t dangerous.
But perhaps I’m being unreasonable, and there are some hypothetical worlds and goals for which this argument doesn’t work. Here’s why I think the method is generally sufficient: suppose that the objective cannot be completed at all without doing some high-impact plan. Then by N-incrementing, the first plan that reaches the goal will be the minimal plan that has this necessary impact, without the extra baggage of unnecessary, undesirable effects.
[note: this supposes that there aren’t undesirable pseudo-ways of reaching the goal before we reach the outcome in mind. This seems plausible due to the structuring of the measure, but shouldn’t be taken for granted.]
Analogously, I am saying that we can seemingly get all the low-impact results we need without offsetting using AUP. You point out specific plans which would be allowed if we could offset in a reasonable way. I say that that problem seems really hard, but it looks like my method lets us get effectively the same thing done without needing to figure that out.
I’m mostly confused because there’s substantial focus on the fact AUP penalizes specific plans (although I definitely agree that some hypothetical measure which does assign impact according to our exact intuitions would be better than one that’s conservative), instead of realizing AUP can seemingly do whatever we need in some way (for which I think I give a pretty decent argument above), and also has nice properties to work with in general (like seemingly not taking off, acausally cooperating, acting to survive, etc). I’m cautiously hopeful that these properties are going to open really important doors.
It allows plans like the chauffeur example, while seemingly disallowing weird cheats.
Technical discussion of AUP
Yes, but I think this can be fixed by just not allowing dumb agents near really high impact opportunities. By the time that they would be able to purposefully construct a plan that is high impact to better pursue their goals, they already (by supposition) have enough model richness to plot the consequences, so I don’t see how this is a non-trivial risk.
This seems to more generally just be a problem with not knowing what you don’t know, and the method is compatible with whatever solutions we do come up with. Furthermore, instead of needing to know whether effects are bad, the agent only needs to know whether they are big (I just realized this now!). This is already an improvement on the state-of-the-art for safe learning, as I understand it. That is, AUP becomes far less likely to do things as soon as it realizes that their consequences are big—instead of waiting for us to tell it that the consequences are bad.
Because I claim this is high impact, and not the job of a low impact agent. I think a more sensible use of a low-impact agent would be as a technical oracle, which could help us design an agent which would do this. Making this not useless is not trivial, but that’s for a later post. I think it might be possible, and more appropriate than using it for something as large as protection from natural disasters.
Why is this weird behavior? If it has a dance action, it should always be able to execute this action? It retains the dance action, if we’re actually using this, and then turns into a pure measure of power (u_1 - can it remain activated for the remainder of the attainable horizon, in order to ensure it retains the 1 utility rating?), which I have argued tracks what we want.
Desiderata of impact regularisation techniques
So there’s a narrow answer and a broad answer here. The narrow answer is that if you tell me that AUP won’t allow plan X but will allow plan Y, then I have to be convinced that Y will be possible whenever X was, and that this is also true for X’ that are pretty similar to X along the relevant dimension that made me bring up X. This is a substantial, but not impossible, bar to meet.
The broad answer is that if I want to figure out if AUP is a good impact regularisation technique, then one of the easiest ways I can do that is to check a plan that seems like it obviously should or should not be allowed, and then check if it is or is not allowed. This lets me check if AUP is identical to my internal sense of whether things obviously should or should not be allowed. If it is, then great, and if it’s not, then I might worry that it will run into substantial trouble in complicated scenarios that I can’t really picture. It’s a nice method of analysis because it requires few assumptions about what things are possible in what environments (compared to “look at a bunch of environments and see if the plans AUP comes up with should be allowed”) and minimal philosophising (compared to “meditate on the equations and see if they’re analytically identical to how I feel impact should be defined”).
[EDIT: added content to this section]
Firstly, saving humanity from natural disasters doesn’t at all seem like the thing I was worried about when I decided that I needed impact regularisation, and seems like it’s plausibly in a different natural reference class than causing natural disasters. Secondly, your description of a use case for a low-impact agent is interesting and one that I hadn’t thought of before, but I still would hope that they could be used in a wider range of settings (basically, whenever I’m worried that a utility function has an unforeseen maximum that incentivises extreme behaviour).
I think there is an argument for this whenever we have “it won’t X because anti-survival incentive incentive and personal risk”: “then it builds a narrow subagent to do X”.
As I said in my other comment, I think we have reasonable evidence that it’s hitting the should-nots, which is arguably more important for this kind of measure. The question is, how can we let it allow more shoulds?
Why would that be so? That doesn’t seem value agnostic. I do think that the approval incentives help us implicitly draw this boundary, as I mentioned in the other comment.
I agree. I’m not saying that the method won’t work for these, to clarify.
Two points:
Firstly, the first section of this comment by Rohin models my opinions quite well, which is why some sort of asymmetry bothers me. Another angle on this is that I think it’s going to be non-trivial to relax an impact measure to allow enough low-impact plans without also allowing a bunch of high-impact plans.
Secondly, here and in other places I get the sense that you want comments to be about the best successor theory to AUP as outlined here. I think that what this best successor theory is like is an important one when figuring out whether you have a good line of research going or not. That being said, I have no idea what the best successor theory is like. All I know is what’s in this post, and I’m much better at figuring out what will happen with the thing in the post than figuring out what will happen with the best successors, so that’s what I’m primarily doing.
It seems value agnostic to me because it can be generated from the urge ‘keep the world basically like how it used to be’.
But in this same comment, you also say
People keep saying things like this, and it might be true. But on what data are we basing this? Have we tried relaxing an impact measure, given that we have a conceptual core in hand?
I’m making my predictions based off of my experience working with the method. The reason that many of the flaws are on the list is not because I don’t think I could find a way around them, but rather because I’m one person with a limited amount of time. It will probably turn out that some of them are non-trivial, but pre-judging them doesn’t seem very appropriate.
I indeed want people to share their ideas for improving the measure. I also welcome questioning specific problems or pointing out new ones I hadn’t noticed. However, arguing whether certain problems subjectively seem hard or maybe insurmountable isn’t necessarily helpful at this point in time. As you said in another comment,
.
True, but avoiding lock-in seems value laden for any approach doing that, reducing back to the full problem: what “kinds of things” can change? Even if we knew that, who can change things? But this is the clinginess / scapegoating tradeoff again.
Primarily does not mean exclusively, and lack of confidence in implications between desiderata doesn’t imply lack of confidence in opinions about how to modify impact measures, which itself doesn’t imply lack of opinions about how to modify impact measures.
This is according to my intuitions about what theories do what things, which have had as input a bunch of learning mathematics, reading about algorithms in AI, and thinking about impact measures. This isn’t a rigorous argument, or even necessarily an extremely reliable method of ascertaining truth (I’m probably quite sub-optimal in converting experience into intuitions), but it’s still my impulse.
My sense is that we agree that this looks hard but shouldn’t be dismissed as impossible.
What? I’ve never tried to write an algorithm to search an unordered set of numbers in O(log n) time, yet I’m quite certain it can’t be done. It is possible to make a real claim about X without having tried to do X. Granted, all else equal trying to do X will probably make your claims about X more likely to be true (but I can think of cases where this is false as well).
I’m clearly not saying you can never predict things before trying them, I’m saying that I haven’t seen evidence that this particular problem is more or less challenging than dozens of similar-feeling issues I handled while constructing AUP.
Going back to this, what is the way you propose the species-creating goal be done? Say, imposing the constraint that the species has got to be basically just human (because we like humans) and you don’t get to program their DNA in advance? My guess at your answer is “create a sub-agent that reliably just does the stern talking-to in the way the original agent would”, but I’m not certain.
My real answer: we probably shouldn’t? Creating sentient life that has even slightly different morals seems like a very morally precarious thing to do without significant thought. (See the cheese post, can’t find it)
Uh, why not?
Make humans that will predictably end up deciding not to colonize the galaxy or build superintelligences.
I guess I’m more comfortable with procreation than you are :)
I imposed the “you don’t get to program their DNA in advance” constraint since it seems plausible to me that if you want to create a new colony of actual humans, you don’t have sufficient degrees of human to make them actually human-like but also docile enough.
You could imagine a similar task of “build a rather powerful AI system that is transparent and able to be monitored”, where perhaps ongoing supervision is required, but that’s not an onerous burden.
Technical discussion of AUP
This is only convincing to the extent that I buy into AUP’s notion of impact. My general impression is that it seems vaguely sketchy (due to things that I consider low-impact being calculated as high-impact) and is not analytically identical to the core thing that I care about (human ability to achieve goals that humans plausibly care about), but may well turn out to be fine if I considered it for a long time.
I agree that the nice properties of AUP are pretty nice and demonstrate a significant advance in the state of the art for impact regularisation, and did indeed put that in my first bullet point of what I thought of AUP, although I guess I didn’t have much to say about it.
This is a good point against worrying about an AUP agent that once acted against the AUP objective, but I have some residual concern both in the form of (a) this feels like wrong behaviour and maybe points to wrongness that manifests in harmful ways (see sibling comment) and (b) even with a good model, presumably if it’s run for a long time there might be at least one error, and I’m inherently worried by a protocol that fails ungracefully if it stops being followed at any one point in time. However, I think the stronger objection here is the ‘natural disaster’ category (which might include an actuator in the AUP agent going haywire or any number of things).
Note that AUP would not even notify humans that such a natural disaster was happening if it thought that humans would solve the natural disaster iff they were notified. In general, AFAICT, if you have a natural-disaster warning AUP agent, then it’s allowed to warn humans of a natural disaster iff it’s allowed to cause a natural disaster (I think even impact verification doesn’t prevent this, if you imagine that causing a natural disaster is an unforeseen maximum of the agent’s utility function). This seems like a failure mode that impact regularisation techniques ought to prevent. I also have a different reaction to this section in the sibling comment.
I think it should be quite possible for us to de-sketchify the impact measure in the ways you pointed out. Up to now, I focused more on ensuring that there aren’t errors of the other type: where high impact plans sneak through as low impact. I’m currently not aware of any, although that isn’t to say they don’t exist.
Also, the fact that we can now talk about precisely what we think impact is with respect to goals makes me more optimistic. I don’t think it unlikely that there exist better, cleaner formulations of what I provided. Perhaps they somehow don’t have the bothersome false positives you’ve pointed out. After all, compared to many folks in the community, I’m fairly mathematically inexperienced, and have only been working on this for a relatively short amount of time.
What is “this” here (for a)?
But AUP’s plans are shutdown-safe? I think I misunderstand.
I actually think that AUP agents would prevent natural disasters which wouldn’t disable the agent itself. Also, your claim is not true, due to approval incentives and the fact that an agent incentivized to save us from disasters wouldn’t get any extra utility by causing disasters (unless it also wanted to save us from these, but it seems like this would only happen for higher impact levels and would be discouraged by approval incentives).
In general, I expect AUP to also work for disaster prevention, as long as its own survival isn’t affected. One complication is that we would have to allow it to remain on, even if it didn’t save us from disasters, but shut it off if it caused any. I think that’s pretty reasonable, as we expect our low impact agents to not do anything sometimes.
To be frank, although I do like the fact that there’s a nice concrete candidate definition of impact, I am not excited by it by more than a factor of two over other candidate impact definitions, and would not say that it encapsulates what I think impact is.
“This” is “upon hypothetically performing some high-impact action, try not to change attainable utilities from that baseline”, and it’s what I mean by “ungracefully failing if the protocol stops being followed at any one point in time”.
Regarding whether AUP agents would prevent natural disasters: AFAICT if humans have any control over the agent, or any ways of making it harder for the agent to achieve a wide variety of goals, then preventing their demise (and presumably the demise of their control over the AUP agent) would be high-AUP-impact, since it would impede the agent’s ability to achieve a wide variety of goals.
Regarding approval incentive: my understanding is that in AUP this only acts to incentivise actual approval (as opposed to hypothetical maximally informed approval). One could cause a natural disaster without humans being aware of it unless there was quite good interpretability, which I wasn’t taking as an assumption that you were making.
Regarding the lack of incentive to cause disasters: in my head, the point of impact regularisation techniques is to stop agents from doing something crazy in cases where doing something crazy is an unforeseen convenient way for the agent to achieve its objective. As such, I consider it fair game to consider cases where there is an unforeseen incentive to do crazy things, if the argument generalises over a wide variety of craziness, which I think this one does sort of OK.
Huh? So if the safety measure stops working for some reason, it’s no longer safe? But if it does make a mistake, it’s more inclined to allow us to shut it down. Compare this to an offsetting approach, where it can keep doing and undoing things to an arbitrarily-large degree.
AUP agent does a big thing and bites the penalty. If that big thing was bad, we shut it down. Why would you instead prefer that it keep doing things to make up for it, when its model wasn’t even good enough to predict we wouldn’t like it?
This feels like an odd standard, where you say “but maybe it randomly fails and then doesn’t work”, or “it can’t anticipate things it doesn’t know about”. While these are problems, they aren’t for low impact to resolve, but the approach also happens to help anyways.
This is true. It depends what the scale is—I had “remote local disaster” in mind, while you maybe had x-risk.
[Note that we could Bayes-update off of its canary in general, if we trust its model to an extent. This also deserves exploration as a binary “extinction?” oracle, with the sequential deployment of agents allowing mitigation of specific model flaws.]
We also aren’t assuming the machinery is so opaque that it has extremely negligible chance of being caught, even under scrutiny (although this is possible. I have a rough intuition the strength of approval will override the fairly high likelihood of getting away with it). Making yourself purposefully opaque seems convergent.
I want to point to the difference between behavioural cloning and reward methods for the problem of learning locomotion for robots. Behavioural cloning is where you learn what a human will do in any situation and act that way, while reward methods take a reward function (either learned or specified) that encourages locomotion and learn to maximise that reward function. An issue with behavioural cloning is that it’s unstable: if you get what the human would do slightly wrong, then you move to a state the human is less likely to be in, so your model gets worse, so you’re more likely to act incorrectly (both in the sense of “higher probability of incorrect actions” and “more probability of more extremely incorrect answers”), and so you go to more unusual states, etc. In contrast, reward methods promise to be more stable, since the Q-values generated by the reward function tend to be more valid even in unusual states. This is the story that I’ve heard for why behavioural cloning techniques are less prominent[*] than reward methods. In general, it’s bad if your machine learning technique amplifies rather than mitigates errors, either during training or during execution.
My claim here is not quite that AUP amplifies ‘errors’ (in this case, differences between how the world will turn out and normality), but that it preserves them rather than mitigates them. This is in contrast to methods that measure divergence to the starting state, or what the world would be like given that the agent had only performed no-ops after the starting state, resulting in a tendency to mitigate these ‘errors’. At any rate, even if no other method mitigated these ‘errors’, I would still want them to.
I wasn’t necessarily imagining x-risk, but maybe something like an earthquake along the San Andreas fault, disrupting the San Franciscan engineers that would be supervising the agents.
My impression is that most machine learning systems are extremely opaque to currently available analysis tools in the relevant fashion. I think that work to alleviate this opacity is extremely important, but not something that I would assume without mentioning it.
[*] Work is in fact done on behavioural cloning today, but with attempts to increase its stability.
Perhaps we could have it recalculate past impacts? It seems like that could maybe lead to it regaining ability to act, which could also be negative.
Edit:
But if its model was wrong and it does something that it now infers was bad (because we are now moving to shut it down), its model is still probably incorrect. So it seems like what we want it to do is just nothing, letting us clean up the mess. If its model is probably still incorrect, even if we had a direction in which it thought it should mitigate, why should we expect this second attempt to be correct? I disagree presently that agent mitigation is the desirable behavior after model errors.
Yeah, I have a sense that having the penalty be over the actual history and action versus the plan of no-ops since birth will resolve this issue.
I agree that if it infers that it did something bad because humans are now moving to shut it down, it should probably just do nothing and let us fix things up. However, it might be a while until the humans move to shut it down, if they don’t understand what’s happened. In this scenario, I think you should see the preservation of ‘errors’ in the sense of the agent’s future under no-ops differing from ‘normality’.
If ‘errors’ happen due to a mismatch between the model and reality, I agree that the agent shouldn’t try to fix them with the bits of the model that are broken. However, I just don’t think that that describes many of the things that cause ‘errors’: those can be foreseen natural events (e.g. San Andreas earthquake if you’re good at predicting earthquake), unlikely but possible natural events (e.g. San Andreas earthquake if you’re not good at predicting earthquakes), or unlikely consequences of actions. In these situations, agent mitigation still seems like the right approach to me.