So it seems to me like on one hand we are assuming that the agent can come up with really clever ways of getting around the impact measure. But when it comes to using the impact measure, we seem to be insisting that it follow the first way that comes to mind. That is, people say “the measure doesn’t let us do X in this way!”, and they’re right. I then point out a way in which X can be done, but people don’t seem to be satisfied with that. This confuses me.
So there’s a narrow answer and a broad answer here. The narrow answer is that if you tell me that AUP won’t allow plan X but will allow plan Y, then I have to be convinced that Y will be possible whenever X was, and that this is also true for X’ that are pretty similar to X along the relevant dimension that made me bring up X. This is a substantial, but not impossible, bar to meet.
The broad answer is that if I want to figure out if AUP is a good impact regularisation technique, then one of the easiest ways I can do that is to check a plan that seems like it obviously should or should not be allowed, and then check if it is or is not allowed. This lets me check if AUP is identical to my internal sense of whether things obviously should or should not be allowed. If it is, then great, and if it’s not, then I might worry that it will run into substantial trouble in complicated scenarios that I can’t really picture. It’s a nice method of analysis because it requires few assumptions about what things are possible in what environments (compared to “look at a bunch of environments and see if the plans AUP comes up with should be allowed”) and minimal philosophising (compared to “meditate on the equations and see if they’re analytically identical to how I feel impact should be defined”).
[EDIT: added content to this section]
Because I claim [that saving humanity from natural disasters] is high impact, and not the job of a low impact agent. I think a more sensible use of a low-impact agent would be as a technical oracle, which could help us design an agent which would do this. Making this not useless is not trivial, but that’s for a later post. I think it might be possible, and more appropriate than using it for something as large as protection from natural disasters.
Firstly, saving humanity from natural disasters doesn’t at all seem like the thing I was worried about when I decided that I needed impact regularisation, and seems like it’s plausibly in a different natural reference class than causing natural disasters. Secondly, your description of a use case for a low-impact agent is interesting and one that I hadn’t thought of before, but I still would hope that they could be used in a wider range of settings (basically, whenever I’m worried that a utility function has an unforeseen maximum that incentivises extreme behaviour).
if you tell me that AUP won’t allow plan X but will allow plan Y, then I have to be convinced that Y will be possible whenever X was, and that this is also true for X’ that are pretty similar to X along the relevant dimension that made me bring up X.
I think there is an argument for this whenever we have “it won’t X because anti-survival incentive incentive and personal risk”: “then it builds a narrow subagent to do X”.
The broad answer is that if I want to figure out if AUP is a good impact regularisation technique, then one of the easiest ways I can do that is to check a plan that seems like it obviously should or should not be allowed,
As I said in my other comment, I think we have reasonable evidence that it’s hitting the should-nots, which is arguably more important for this kind of measure. The question is, how can we let it allow more shoulds?
Firstly, saving humanity from natural disasters doesn’t at all seem like the thing I was worried about when I decided that I needed impact regularisation, and seems like it’s plausibly in a different natural reference class than causing natural disasters.
Why would that be so? That doesn’t seem value agnostic. I do think that the approval incentives help us implicitly draw this boundary, as I mentioned in the other comment.
I still would hope that they could be used in a wider range of settings (basically, whenever I’m worried that a utility function has an unforeseen maximum that incentivises extreme behaviour).
I agree. I’m not saying that the method won’t work for these, to clarify.
I think we have reasonable evidence that it’s hitting the should-nots, which is arguably more important for this kind of measure. The question is, how can we let it allow more shoulds?
Two points:
Firstly, the first section of this comment by Rohin models my opinions quite well, which is why some sort of asymmetry bothers me. Another angle on this is that I think it’s going to be non-trivial to relax an impact measure to allow enough low-impact plans without also allowing a bunch of high-impact plans.
Secondly, here and in other places I get the sense that you want comments to be about the best successor theory to AUP as outlined here. I think that what this best successor theory is like is an important one when figuring out whether you have a good line of research going or not. That being said, I have no idea what the best successor theory is like. All I know is what’s in this post, and I’m much better at figuring out what will happen with the thing in the post than figuring out what will happen with the best successors, so that’s what I’m primarily doing.
Firstly, saving humanity from natural disasters… seems like it’s plausibly in a different natural reference class than causing natural disasters.
Why would that be so? That doesn’t seem value agnostic.
It seems value agnostic to me because it can be generated from the urge ‘keep the world basically like how it used to be’.
I have no idea what the best successor theory is like. All I know is what’s in this post, and I’m much better at figuring out what will happen with the thing in the post than figuring out what will happen with the best successors, so that’s what I’m primarily doing.
But in this same comment, you also say
I think it’s going to be non-trivial to relax an impact measure
People keep saying things like this, and it might be true. But on what data are we basing this? Have we tried relaxing an impact measure, given that we have a conceptual core in hand?
I’m making my predictions based off of my experience working with the method. The reason that many of the flaws are on the list is not because I don’t think I could find a way around them, but rather because I’m one person with a limited amount of time. It will probably turn out that some of them are non-trivial, but pre-judging them doesn’t seem very appropriate.
I indeed want people to share their ideas for improving the measure. I also welcome questioning specific problems or pointing out new ones I hadn’t noticed. However, arguing whether certain problems subjectively seem hard or maybe insurmountable isn’t necessarily helpful at this point in time. As you said in another comment,
I’m not very confident in any alleged implications between impact desiderata that are supposed to generalise over all possible impact measures—see the ones that couldn’t be simultaneously satisfied until this one did.
.
It seems value agnostic to me because it can be generated from the urge ‘keep the world basically like how it used to be’.
True, but avoiding lock-in seems value laden for any approach doing that, reducing back to the full problem: what “kinds of things” can change? Even if we knew that, who can change things? But this is the clinginess / scapegoating tradeoff again.
Primarily does not mean exclusively, and lack of confidence in implications between desiderata doesn’t imply lack of confidence in opinions about how to modify impact measures, which itself doesn’t imply lack of opinions about how to modify impact measures.
People keep saying things like [‘it’s non-trivial to relax impact measures’], and it might be true. But on what data are we basing this?
This is according to my intuitions about what theories do what things, which have had as input a bunch of learning mathematics, reading about algorithms in AI, and thinking about impact measures. This isn’t a rigorous argument, or even necessarily an extremely reliable method of ascertaining truth (I’m probably quite sub-optimal in converting experience into intuitions), but it’s still my impulse.
True, but avoiding lock-in seems value laden for any approach doing that, reducing back to the full problem: what “kinds of things” can change? Even if we knew that, who can change things? But this is the clinginess / scapegoating tradeoff again.
My sense is that we agree that this looks hard but shouldn’t be dismissed as impossible.
People keep saying things like this, and it might be true. But on what data are we basing this? Have we tried relaxing an impact measure, given that we have a conceptual core in hand?
What? I’ve never tried to write an algorithm to search an unordered set of numbers in O(log n) time, yet I’m quite certain it can’t be done. It is possible to make a real claim about X without having tried to do X. Granted, all else equal trying to do X will probably make your claims about X more likely to be true (but I can think of cases where this is false as well).
I’m clearly not saying you can never predict things before trying them, I’m saying that I haven’t seen evidence that this particular problem is more or less challenging than dozens of similar-feeling issues I handled while constructing AUP.
Desiderata of impact regularisation techniques
So there’s a narrow answer and a broad answer here. The narrow answer is that if you tell me that AUP won’t allow plan X but will allow plan Y, then I have to be convinced that Y will be possible whenever X was, and that this is also true for X’ that are pretty similar to X along the relevant dimension that made me bring up X. This is a substantial, but not impossible, bar to meet.
The broad answer is that if I want to figure out if AUP is a good impact regularisation technique, then one of the easiest ways I can do that is to check a plan that seems like it obviously should or should not be allowed, and then check if it is or is not allowed. This lets me check if AUP is identical to my internal sense of whether things obviously should or should not be allowed. If it is, then great, and if it’s not, then I might worry that it will run into substantial trouble in complicated scenarios that I can’t really picture. It’s a nice method of analysis because it requires few assumptions about what things are possible in what environments (compared to “look at a bunch of environments and see if the plans AUP comes up with should be allowed”) and minimal philosophising (compared to “meditate on the equations and see if they’re analytically identical to how I feel impact should be defined”).
[EDIT: added content to this section]
Firstly, saving humanity from natural disasters doesn’t at all seem like the thing I was worried about when I decided that I needed impact regularisation, and seems like it’s plausibly in a different natural reference class than causing natural disasters. Secondly, your description of a use case for a low-impact agent is interesting and one that I hadn’t thought of before, but I still would hope that they could be used in a wider range of settings (basically, whenever I’m worried that a utility function has an unforeseen maximum that incentivises extreme behaviour).
I think there is an argument for this whenever we have “it won’t X because anti-survival incentive incentive and personal risk”: “then it builds a narrow subagent to do X”.
As I said in my other comment, I think we have reasonable evidence that it’s hitting the should-nots, which is arguably more important for this kind of measure. The question is, how can we let it allow more shoulds?
Why would that be so? That doesn’t seem value agnostic. I do think that the approval incentives help us implicitly draw this boundary, as I mentioned in the other comment.
I agree. I’m not saying that the method won’t work for these, to clarify.
Two points:
Firstly, the first section of this comment by Rohin models my opinions quite well, which is why some sort of asymmetry bothers me. Another angle on this is that I think it’s going to be non-trivial to relax an impact measure to allow enough low-impact plans without also allowing a bunch of high-impact plans.
Secondly, here and in other places I get the sense that you want comments to be about the best successor theory to AUP as outlined here. I think that what this best successor theory is like is an important one when figuring out whether you have a good line of research going or not. That being said, I have no idea what the best successor theory is like. All I know is what’s in this post, and I’m much better at figuring out what will happen with the thing in the post than figuring out what will happen with the best successors, so that’s what I’m primarily doing.
It seems value agnostic to me because it can be generated from the urge ‘keep the world basically like how it used to be’.
But in this same comment, you also say
People keep saying things like this, and it might be true. But on what data are we basing this? Have we tried relaxing an impact measure, given that we have a conceptual core in hand?
I’m making my predictions based off of my experience working with the method. The reason that many of the flaws are on the list is not because I don’t think I could find a way around them, but rather because I’m one person with a limited amount of time. It will probably turn out that some of them are non-trivial, but pre-judging them doesn’t seem very appropriate.
I indeed want people to share their ideas for improving the measure. I also welcome questioning specific problems or pointing out new ones I hadn’t noticed. However, arguing whether certain problems subjectively seem hard or maybe insurmountable isn’t necessarily helpful at this point in time. As you said in another comment,
.
True, but avoiding lock-in seems value laden for any approach doing that, reducing back to the full problem: what “kinds of things” can change? Even if we knew that, who can change things? But this is the clinginess / scapegoating tradeoff again.
Primarily does not mean exclusively, and lack of confidence in implications between desiderata doesn’t imply lack of confidence in opinions about how to modify impact measures, which itself doesn’t imply lack of opinions about how to modify impact measures.
This is according to my intuitions about what theories do what things, which have had as input a bunch of learning mathematics, reading about algorithms in AI, and thinking about impact measures. This isn’t a rigorous argument, or even necessarily an extremely reliable method of ascertaining truth (I’m probably quite sub-optimal in converting experience into intuitions), but it’s still my impulse.
My sense is that we agree that this looks hard but shouldn’t be dismissed as impossible.
What? I’ve never tried to write an algorithm to search an unordered set of numbers in O(log n) time, yet I’m quite certain it can’t be done. It is possible to make a real claim about X without having tried to do X. Granted, all else equal trying to do X will probably make your claims about X more likely to be true (but I can think of cases where this is false as well).
I’m clearly not saying you can never predict things before trying them, I’m saying that I haven’t seen evidence that this particular problem is more or less challenging than dozens of similar-feeling issues I handled while constructing AUP.