I guess I just don’t understand your argument here for why this won’t work. If it’s catching too many false positives, that’s a great thing in that we just have to make it a little more lenient, but have accomplished the seemingly more difficult task of stopping malignant behavior. If it isn’t catching too many, as I suspect but am not totally convinced is the case, we’re good to go in this regard.
For example, if we do end up having to just ride the optimal plan until it becomes too high-impact, perhaps we can simply keep replaying the favorable first part of the plan (where it tries to please us by actually doing what we want), over and over.
I guess I just don’t understand your argument here for why this won’t work. If it’s catching too many false positives, that’s a great thing in that we just have to make it a little more lenient, but have accomplished the seemingly more difficult task of stopping malignant behavior. If it isn’t catching too many, as I suspect but am not totally convinced is the case, we’re good to go in this regard.
Edit to add: the following is just to illustrate what I don’t understand about your argument (needless to say I don’t suggest the two things are comparable in any way).
All this can be said on a filter that accepts an action iff a random number in the range [0,1] is greater than x. You can set x=1 and catch too many false positive while stopping malignant behavior. Decreasing x will make the filter more lenient, but at no point will it be useful.
If you argue that the Intent Verification filter can be used to prevent the bad tricks we discussed, you need to show that you can use it to filter out the bad actions while still allowing good ones (and not only in time steps in which some action can yield sufficiently high utility increase). My comment above is an argument for it not being the case.
For example, if we do end up having to just ride the optimal plan until it becomes too high-impact, perhaps we can simply keep replaying the favorable first part of the plan (where it tries to please us by actually doing what we want), over and over.
Assuming the the optimal plan starts by pursuing some (unsafe) convergent instrumental goal—we can’t ride it even a bit. Also—I’m not sure I understand how “replaying” will be implemented in a useful way.
All this can be said on a filter that accepts an action iff a random number in the range [0,1] is greater than… and catch too many false positive while stopping malignant behavior. Decreasing x will make the filter more lenient, but at no point will it be useful.
This is a clear strawman, and is compounding the sense I have that we’re trying to score points now.
while still allowing good ones (and not only in time steps in which some action can yield sufficiently high utility increase). My comment above is an argument for it not being the case.
No, your argument is that there are certain false positives, which I don’t contest. I even listed this kind of thing as an open question, and am interested in further discussion of how we can go about ensuring IV is properly-tuned.
You’re basically saying, “There are false positives, so that makes the core insight that allows IV to work the extent it does wrong, and unlikely to be fixable.” I disagree with this conclusion.
If you want to discuss how we could resolve or improve this issue, I’m interested. Otherwise, I don’t think continuing this conversation will be very productive.
Assuming the the optimal plan starts by pursuing some (unsafe) convergent instrumental goal—we can’t ride it even a bit. Also—I’m not sure I understand how “replaying” will be implemented in a useful way.
Well I certainly empathize with the gut reaction, that isn’t quite right.
Notice that the exact same actions had always been available before we restricted available actions to the optimal or to nothing. I think it’s possible that we could just step along the first n steps of the best plan stopping earlier in a way that lets us just get the good behavior, before any instrumental behavior is actually completed. It’s also possible that this isn’t true. This is all speculation at this point, which is why my tone in that section was also very speculative.
This is a clear strawman, and is compounding the sense I have that we’re trying to score points now.
I sincerely apologize, I sometimes completely fail to communicate my intention. I gave the example of the random filter only to convey what I don’t understand about your argument (needless to say I don’t suggest the two things are comparable in any way). I should have wrote that explicitly (edited). Sorry!
If you want to discuss how we could resolve or improve this issue, I’m interested.
Of course! I’ll think about this topic some more. I suggest we take this offline—the nesting level here has quite an impact on my browser :)
This is a clear strawman, and is compounding the sense I have that we’re trying to score points now.
Fwiw, I would make the same argument that ofer did (though I haven’t read the rest of the thread in detail). For me, that argument is an existence proof that shows the following claim: if you know nothing about an impact measure, it is possible that the impact measure disallows all malignant behavior, and yet all of the difficulty is in figuring out how to make it lenient enough.
Now, obviously we know something about AUP, but It’s not obvious to me that we can make AUP lenient enough to do useful things without also allowing malignant behavior.
My present position is that it can seemingly do every task in at least one way, and we should expand the number of ways to line up with our intuitions just to be sure.
I guess I just don’t understand your argument here for why this won’t work. If it’s catching too many false positives, that’s a great thing in that we just have to make it a little more lenient, but have accomplished the seemingly more difficult task of stopping malignant behavior. If it isn’t catching too many, as I suspect but am not totally convinced is the case, we’re good to go in this regard.
For example, if we do end up having to just ride the optimal plan until it becomes too high-impact, perhaps we can simply keep replaying the favorable first part of the plan (where it tries to please us by actually doing what we want), over and over.
Edit to add: the following is just to illustrate what I don’t understand about your argument (needless to say I don’t suggest the two things are comparable in any way).
All this can be said on a filter that accepts an action iff a random number in the range [0,1] is greater than x. You can set x=1 and catch too many false positive while stopping malignant behavior. Decreasing x will make the filter more lenient, but at no point will it be useful.
If you argue that the Intent Verification filter can be used to prevent the bad tricks we discussed, you need to show that you can use it to filter out the bad actions while still allowing good ones (and not only in time steps in which some action can yield sufficiently high utility increase). My comment above is an argument for it not being the case.
Assuming the the optimal plan starts by pursuing some (unsafe) convergent instrumental goal—we can’t ride it even a bit. Also—I’m not sure I understand how “replaying” will be implemented in a useful way.
This is a clear strawman, and is compounding the sense I have that we’re trying to score points now.
No, your argument is that there are certain false positives, which I don’t contest. I even listed this kind of thing as an open question, and am interested in further discussion of how we can go about ensuring IV is properly-tuned.
You’re basically saying, “There are false positives, so that makes the core insight that allows IV to work the extent it does wrong, and unlikely to be fixable.” I disagree with this conclusion.
If you want to discuss how we could resolve or improve this issue, I’m interested. Otherwise, I don’t think continuing this conversation will be very productive.
Well I certainly empathize with the gut reaction, that isn’t quite right.
Notice that the exact same actions had always been available before we restricted available actions to the optimal or to nothing. I think it’s possible that we could just step along the first n steps of the best plan stopping earlier in a way that lets us just get the good behavior, before any instrumental behavior is actually completed. It’s also possible that this isn’t true. This is all speculation at this point, which is why my tone in that section was also very speculative.
I sincerely apologize, I sometimes completely fail to communicate my intention. I gave the example of the random filter only to convey what I don’t understand about your argument (needless to say I don’t suggest the two things are comparable in any way). I should have wrote that explicitly (edited). Sorry!
Of course! I’ll think about this topic some more. I suggest we take this offline—the nesting level here has quite an impact on my browser :)
Fwiw, I would make the same argument that ofer did (though I haven’t read the rest of the thread in detail). For me, that argument is an existence proof that shows the following claim: if you know nothing about an impact measure, it is possible that the impact measure disallows all malignant behavior, and yet all of the difficulty is in figuring out how to make it lenient enough.
Now, obviously we know something about AUP, but It’s not obvious to me that we can make AUP lenient enough to do useful things without also allowing malignant behavior.
My present position is that it can seemingly do every task in at least one way, and we should expand the number of ways to line up with our intuitions just to be sure.