if you tell me that AUP won’t allow plan X but will allow plan Y, then I have to be convinced that Y will be possible whenever X was, and that this is also true for X’ that are pretty similar to X along the relevant dimension that made me bring up X.
I think there is an argument for this whenever we have “it won’t X because anti-survival incentive incentive and personal risk”: “then it builds a narrow subagent to do X”.
The broad answer is that if I want to figure out if AUP is a good impact regularisation technique, then one of the easiest ways I can do that is to check a plan that seems like it obviously should or should not be allowed,
As I said in my other comment, I think we have reasonable evidence that it’s hitting the should-nots, which is arguably more important for this kind of measure. The question is, how can we let it allow more shoulds?
Firstly, saving humanity from natural disasters doesn’t at all seem like the thing I was worried about when I decided that I needed impact regularisation, and seems like it’s plausibly in a different natural reference class than causing natural disasters.
Why would that be so? That doesn’t seem value agnostic. I do think that the approval incentives help us implicitly draw this boundary, as I mentioned in the other comment.
I still would hope that they could be used in a wider range of settings (basically, whenever I’m worried that a utility function has an unforeseen maximum that incentivises extreme behaviour).
I agree. I’m not saying that the method won’t work for these, to clarify.
I think we have reasonable evidence that it’s hitting the should-nots, which is arguably more important for this kind of measure. The question is, how can we let it allow more shoulds?
Two points:
Firstly, the first section of this comment by Rohin models my opinions quite well, which is why some sort of asymmetry bothers me. Another angle on this is that I think it’s going to be non-trivial to relax an impact measure to allow enough low-impact plans without also allowing a bunch of high-impact plans.
Secondly, here and in other places I get the sense that you want comments to be about the best successor theory to AUP as outlined here. I think that what this best successor theory is like is an important one when figuring out whether you have a good line of research going or not. That being said, I have no idea what the best successor theory is like. All I know is what’s in this post, and I’m much better at figuring out what will happen with the thing in the post than figuring out what will happen with the best successors, so that’s what I’m primarily doing.
Firstly, saving humanity from natural disasters… seems like it’s plausibly in a different natural reference class than causing natural disasters.
Why would that be so? That doesn’t seem value agnostic.
It seems value agnostic to me because it can be generated from the urge ‘keep the world basically like how it used to be’.
I have no idea what the best successor theory is like. All I know is what’s in this post, and I’m much better at figuring out what will happen with the thing in the post than figuring out what will happen with the best successors, so that’s what I’m primarily doing.
But in this same comment, you also say
I think it’s going to be non-trivial to relax an impact measure
People keep saying things like this, and it might be true. But on what data are we basing this? Have we tried relaxing an impact measure, given that we have a conceptual core in hand?
I’m making my predictions based off of my experience working with the method. The reason that many of the flaws are on the list is not because I don’t think I could find a way around them, but rather because I’m one person with a limited amount of time. It will probably turn out that some of them are non-trivial, but pre-judging them doesn’t seem very appropriate.
I indeed want people to share their ideas for improving the measure. I also welcome questioning specific problems or pointing out new ones I hadn’t noticed. However, arguing whether certain problems subjectively seem hard or maybe insurmountable isn’t necessarily helpful at this point in time. As you said in another comment,
I’m not very confident in any alleged implications between impact desiderata that are supposed to generalise over all possible impact measures—see the ones that couldn’t be simultaneously satisfied until this one did.
.
It seems value agnostic to me because it can be generated from the urge ‘keep the world basically like how it used to be’.
True, but avoiding lock-in seems value laden for any approach doing that, reducing back to the full problem: what “kinds of things” can change? Even if we knew that, who can change things? But this is the clinginess / scapegoating tradeoff again.
Primarily does not mean exclusively, and lack of confidence in implications between desiderata doesn’t imply lack of confidence in opinions about how to modify impact measures, which itself doesn’t imply lack of opinions about how to modify impact measures.
People keep saying things like [‘it’s non-trivial to relax impact measures’], and it might be true. But on what data are we basing this?
This is according to my intuitions about what theories do what things, which have had as input a bunch of learning mathematics, reading about algorithms in AI, and thinking about impact measures. This isn’t a rigorous argument, or even necessarily an extremely reliable method of ascertaining truth (I’m probably quite sub-optimal in converting experience into intuitions), but it’s still my impulse.
True, but avoiding lock-in seems value laden for any approach doing that, reducing back to the full problem: what “kinds of things” can change? Even if we knew that, who can change things? But this is the clinginess / scapegoating tradeoff again.
My sense is that we agree that this looks hard but shouldn’t be dismissed as impossible.
People keep saying things like this, and it might be true. But on what data are we basing this? Have we tried relaxing an impact measure, given that we have a conceptual core in hand?
What? I’ve never tried to write an algorithm to search an unordered set of numbers in O(log n) time, yet I’m quite certain it can’t be done. It is possible to make a real claim about X without having tried to do X. Granted, all else equal trying to do X will probably make your claims about X more likely to be true (but I can think of cases where this is false as well).
I’m clearly not saying you can never predict things before trying them, I’m saying that I haven’t seen evidence that this particular problem is more or less challenging than dozens of similar-feeling issues I handled while constructing AUP.
I think there is an argument for this whenever we have “it won’t X because anti-survival incentive incentive and personal risk”: “then it builds a narrow subagent to do X”.
As I said in my other comment, I think we have reasonable evidence that it’s hitting the should-nots, which is arguably more important for this kind of measure. The question is, how can we let it allow more shoulds?
Why would that be so? That doesn’t seem value agnostic. I do think that the approval incentives help us implicitly draw this boundary, as I mentioned in the other comment.
I agree. I’m not saying that the method won’t work for these, to clarify.
Two points:
Firstly, the first section of this comment by Rohin models my opinions quite well, which is why some sort of asymmetry bothers me. Another angle on this is that I think it’s going to be non-trivial to relax an impact measure to allow enough low-impact plans without also allowing a bunch of high-impact plans.
Secondly, here and in other places I get the sense that you want comments to be about the best successor theory to AUP as outlined here. I think that what this best successor theory is like is an important one when figuring out whether you have a good line of research going or not. That being said, I have no idea what the best successor theory is like. All I know is what’s in this post, and I’m much better at figuring out what will happen with the thing in the post than figuring out what will happen with the best successors, so that’s what I’m primarily doing.
It seems value agnostic to me because it can be generated from the urge ‘keep the world basically like how it used to be’.
But in this same comment, you also say
People keep saying things like this, and it might be true. But on what data are we basing this? Have we tried relaxing an impact measure, given that we have a conceptual core in hand?
I’m making my predictions based off of my experience working with the method. The reason that many of the flaws are on the list is not because I don’t think I could find a way around them, but rather because I’m one person with a limited amount of time. It will probably turn out that some of them are non-trivial, but pre-judging them doesn’t seem very appropriate.
I indeed want people to share their ideas for improving the measure. I also welcome questioning specific problems or pointing out new ones I hadn’t noticed. However, arguing whether certain problems subjectively seem hard or maybe insurmountable isn’t necessarily helpful at this point in time. As you said in another comment,
.
True, but avoiding lock-in seems value laden for any approach doing that, reducing back to the full problem: what “kinds of things” can change? Even if we knew that, who can change things? But this is the clinginess / scapegoating tradeoff again.
Primarily does not mean exclusively, and lack of confidence in implications between desiderata doesn’t imply lack of confidence in opinions about how to modify impact measures, which itself doesn’t imply lack of opinions about how to modify impact measures.
This is according to my intuitions about what theories do what things, which have had as input a bunch of learning mathematics, reading about algorithms in AI, and thinking about impact measures. This isn’t a rigorous argument, or even necessarily an extremely reliable method of ascertaining truth (I’m probably quite sub-optimal in converting experience into intuitions), but it’s still my impulse.
My sense is that we agree that this looks hard but shouldn’t be dismissed as impossible.
What? I’ve never tried to write an algorithm to search an unordered set of numbers in O(log n) time, yet I’m quite certain it can’t be done. It is possible to make a real claim about X without having tried to do X. Granted, all else equal trying to do X will probably make your claims about X more likely to be true (but I can think of cases where this is false as well).
I’m clearly not saying you can never predict things before trying them, I’m saying that I haven’t seen evidence that this particular problem is more or less challenging than dozens of similar-feeling issues I handled while constructing AUP.