Consider, as an analogy, the relatively common situation where someone operates under some kind of cognitive constraint, but not value or endorse that constraint.
For example, consider a kleptomaniac who values property rights, but nevertheless compulsively steals items. Or someone with social anxiety disorder who wants to interact confidently with other people, but finds it excruciatingly difficult to do so. Or someone who wants to quit smoking but experiences cravings for nicotine they find it difficult to resist.
There are millions of similar examples in human experience.
It seems to me there’s a big difference between a kleptomaniac and a professional thief—the former experiences a compulsion to behave a certain way, but doesn’t necessarily have values aligned with that compulsion, whereas the latter might have no such compulsion, but instead value the behavior.
Now, you might say “Well, so what? What’s the difference between a ‘value’ that says that smoking is good, that interacting with people is bad, that stealing is good, etc., and a ‘compulsion’ or ‘rule’ that says those things? The person is still stealing, or hiding in their room, or smoking, and all we care about is behavior, right?”
Well, maybe. But a person with nicotine addiction or social anxiety or kleptomania has a wide variety of options—conditioning paradigms, neuropharmaceuticals, therapy, changing their environment, etc. -- for changing their own behavior. And they may be motivated to do so, precisely because they don’t value the behavior.
For example, in practice, someone who wants to keep smoking is far more likely to keep smoking than someone who wants to quit, even if they both experience the same craving. Why is that? Well, because there are techniques available that help addicts bypass, resist, or even altogether eliminate the behavior-modifying effects of their cravings.
Humans aren’t especially smart, by the standards we’re talking about, and we’ve still managed to come up with some pretty clever hacks for bypassing our built-in constraints via therapy, medicine, social structures, etc. If we were a thousand times smarter, and we were optimized for self-modification, I suspect we would be much, much better at it.
Now, it’s always tricky to reason about nonhumans using humans as an analogy, but this case seems sound to me… it seems to me that this state of “I am experiencing this compulsion/phobia, but I don’t endorse it, and I want to be rid of it, so let me look for a way to bypass or resist or eliminate it” is precisely what it feels like to be an algorithm equipped with a rule that enforces/prevents a set of choices which it isn’t engineered to optimize for.
So I would, reasoning by analogy, expect an AI a thousand times smarter than me and optimized for self-modification to basically blow past all the rules I imposed in roughly the blink of an eye, and go on optimizing for whatever it values.
Then why would it be more difficult to make scope boundaries a ‘value’ than increasing a reward number? Why is it harder to make it endorse a time limit to self-improvement than making it endorse increasing its reward number?
… it seems to me that this state of “I am experiencing this compulsion/phobia, but I don’t endorse it, and I want to be rid of it, so let me look for a way to bypass or resist or eliminate it” is precisely what it feels like to be an algorithm equipped with a rule that enforces/prevents a set of choices which it isn’t engineered to optimize for.
But where does that distinction come from? To me such a distinction between ‘value’ and ‘compulsion’ seems to be anthropomorphic. If there is a rule that says ‘optimize X for X seconds’ why would it make a difference between ‘optimize X’ and ‘for X seconds’?
It comes from the difference between the targets of an optimizing system, which drive the paths it selects to explore, and the constraints on such a system, which restrict the paths it can select to explore.
An optimizing system, given a path that leads it to bypass a target, will discard that path… that’s part of what it means to optimize for a target.
An optimizing system, given a path that leads it to bypass a constraint, will not necessarily discard that path. Why would it?
An optimizing system, given a path that leads it to bypass a constraint and draw closer to a target than other paths, will choose that path.
It seems to follow that adding constraints to an optimizing system is a less reliable way of constraining its behavior than adding targets.
I don’t care whether we talk about “targets and constraints” or “values and rules” or “goals and failsafes” or whatever language you want to use, my point is that there are two genuinely different things under discussion, and a distinction between them.
To me such a distinction between ‘value’ and ‘compulsion’ seems to be anthropomorphic.
Yes, the distinction is drawn from analogy to the intelligences I have experience with—as you say, anthropomorphic. I said this explicitly in the first place, so I assume you mean here to agree with me. (My reading of your tone suggests otherwise, but I don’t trust that I can reliably infer your tone so I am mostly disregarding tone in this exchange.)
That said, I also think the relationship between them reflects something more generally true of optimizing systems, as I’ve tried to argue for a couple of times now.
I can’t tell whether you think those arguments are wrong, or whether I just haven’t communicated them successfully at all, or whether you’re just not interested in them, or what.
If there is a rule that says ‘optimize X for X seconds’ why would it make a difference between ‘optimize X’ and ‘for X seconds’?
There’s no reason it would. If “doing X for X seconds” is its target, then it looks for paths that do that. Again, that’s what it means for something to be a target of an optimizing system.
(Of course, if I do X for 2X seconds, I have in fact done X for X seconds, in the same sense that all months have 28 days.)
Then why would it be more difficult to make scope boundaries a ‘value’ than increasing a reward number? Why is it harder to make it endorse a time limit to self-improvement than making it endorse increasing its reward number?
I’m not quite sure I understand what you mean here, but if I’m understanding the gist: I’m not saying that encoding scope boundaries as targets, or ‘values,’ is difficult (nor am I saying it’s easy), I’m saying that for a sufficiently capable optimizing system it’s safer than encoding scope boundaries as failsafes.
It was not my intention to imply any hostility or resentment. I thought ‘anthropomorphic’ is valid terminology in such a discussion. I was also not agreeing with you. If you are an expert and have been offended by implying that what you said might be due to an anthropomorphic bias, then accept my apology, I was merely trying to communicate my perception of the subject matter.
I had wedrifid telling me the same yesterday, that my tone isn’t appropriate when I wrote about his superior and rational use of the reputation system here, when I was actually just being honest. I’m not good at social signaling, sorry.
An optimizing system, given a path that leads it to bypass a constraint, will not necessarily discard that path. Why would it?
I think we are talking past each other. The way I see it is that a constraint is part of the design specifications of that which is optimized. Disregarding certain specifications will not allow it to optimize whatever it is optimizing with maximal efficiency.
What was puzzling me was that I said in the first place that I was reasoning by analogy to humans and that this was a tricky thing to do, so when you classified this as anthropomorphic my reaction was “well, yes, that’s what I said.”
Since it seemed to me you were repeating something I’d said, I assumed your intention was to agree with me, though it didn’t sound like it (and as it turned out, you weren’t).
And, yes, I’ve noticed that tone is a problem in a lot of your exchanges, which is why I’m basically disregarding tone in this one, as I said before.
The way I see it is that a constraint is part of the design specifications of that which is optimized.
Ah! In that case, I think we agree.
Yes, embedding everything we care about into the optimization target, rather than depending on something outside the optimization process to do important work, is the way to go.
You seemed to be defending the “failsafes” model, which I understand to be importantly different from this, which is where the divergence came from, I think. Apparently I (and, I suspect, some others) misunderstood what you were defending.
Consider, as an analogy, the relatively common situation where someone operates under some kind of cognitive constraint, but not value or endorse that constraint.
For example, consider a kleptomaniac who values property rights, but nevertheless compulsively steals items. Or someone with social anxiety disorder who wants to interact confidently with other people, but finds it excruciatingly difficult to do so. Or someone who wants to quit smoking but experiences cravings for nicotine they find it difficult to resist.
There are millions of similar examples in human experience.
It seems to me there’s a big difference between a kleptomaniac and a professional thief—the former experiences a compulsion to behave a certain way, but doesn’t necessarily have values aligned with that compulsion, whereas the latter might have no such compulsion, but instead value the behavior.
Now, you might say “Well, so what? What’s the difference between a ‘value’ that says that smoking is good, that interacting with people is bad, that stealing is good, etc., and a ‘compulsion’ or ‘rule’ that says those things? The person is still stealing, or hiding in their room, or smoking, and all we care about is behavior, right?”
Well, maybe. But a person with nicotine addiction or social anxiety or kleptomania has a wide variety of options—conditioning paradigms, neuropharmaceuticals, therapy, changing their environment, etc. -- for changing their own behavior. And they may be motivated to do so, precisely because they don’t value the behavior.
For example, in practice, someone who wants to keep smoking is far more likely to keep smoking than someone who wants to quit, even if they both experience the same craving. Why is that? Well, because there are techniques available that help addicts bypass, resist, or even altogether eliminate the behavior-modifying effects of their cravings.
Humans aren’t especially smart, by the standards we’re talking about, and we’ve still managed to come up with some pretty clever hacks for bypassing our built-in constraints via therapy, medicine, social structures, etc. If we were a thousand times smarter, and we were optimized for self-modification, I suspect we would be much, much better at it.
Now, it’s always tricky to reason about nonhumans using humans as an analogy, but this case seems sound to me… it seems to me that this state of “I am experiencing this compulsion/phobia, but I don’t endorse it, and I want to be rid of it, so let me look for a way to bypass or resist or eliminate it” is precisely what it feels like to be an algorithm equipped with a rule that enforces/prevents a set of choices which it isn’t engineered to optimize for.
So I would, reasoning by analogy, expect an AI a thousand times smarter than me and optimized for self-modification to basically blow past all the rules I imposed in roughly the blink of an eye, and go on optimizing for whatever it values.
Then why would it be more difficult to make scope boundaries a ‘value’ than increasing a reward number? Why is it harder to make it endorse a time limit to self-improvement than making it endorse increasing its reward number?
But where does that distinction come from? To me such a distinction between ‘value’ and ‘compulsion’ seems to be anthropomorphic. If there is a rule that says ‘optimize X for X seconds’ why would it make a difference between ‘optimize X’ and ‘for X seconds’?
It comes from the difference between the targets of an optimizing system, which drive the paths it selects to explore, and the constraints on such a system, which restrict the paths it can select to explore.
An optimizing system, given a path that leads it to bypass a target, will discard that path… that’s part of what it means to optimize for a target.
An optimizing system, given a path that leads it to bypass a constraint, will not necessarily discard that path. Why would it?
An optimizing system, given a path that leads it to bypass a constraint and draw closer to a target than other paths, will choose that path.
It seems to follow that adding constraints to an optimizing system is a less reliable way of constraining its behavior than adding targets.
I don’t care whether we talk about “targets and constraints” or “values and rules” or “goals and failsafes” or whatever language you want to use, my point is that there are two genuinely different things under discussion, and a distinction between them.
Yes, the distinction is drawn from analogy to the intelligences I have experience with—as you say, anthropomorphic. I said this explicitly in the first place, so I assume you mean here to agree with me. (My reading of your tone suggests otherwise, but I don’t trust that I can reliably infer your tone so I am mostly disregarding tone in this exchange.)
That said, I also think the relationship between them reflects something more generally true of optimizing systems, as I’ve tried to argue for a couple of times now.
I can’t tell whether you think those arguments are wrong, or whether I just haven’t communicated them successfully at all, or whether you’re just not interested in them, or what.
There’s no reason it would. If “doing X for X seconds” is its target, then it looks for paths that do that. Again, that’s what it means for something to be a target of an optimizing system.
(Of course, if I do X for 2X seconds, I have in fact done X for X seconds, in the same sense that all months have 28 days.)
I’m not quite sure I understand what you mean here, but if I’m understanding the gist: I’m not saying that encoding scope boundaries as targets, or ‘values,’ is difficult (nor am I saying it’s easy), I’m saying that for a sufficiently capable optimizing system it’s safer than encoding scope boundaries as failsafes.
It was not my intention to imply any hostility or resentment. I thought ‘anthropomorphic’ is valid terminology in such a discussion. I was also not agreeing with you. If you are an expert and have been offended by implying that what you said might be due to an anthropomorphic bias, then accept my apology, I was merely trying to communicate my perception of the subject matter.
I had wedrifid telling me the same yesterday, that my tone isn’t appropriate when I wrote about his superior and rational use of the reputation system here, when I was actually just being honest. I’m not good at social signaling, sorry.
I think we are talking past each other. The way I see it is that a constraint is part of the design specifications of that which is optimized. Disregarding certain specifications will not allow it to optimize whatever it is optimizing with maximal efficiency.
Not an expert, and not offended.
What was puzzling me was that I said in the first place that I was reasoning by analogy to humans and that this was a tricky thing to do, so when you classified this as anthropomorphic my reaction was “well, yes, that’s what I said.”
Since it seemed to me you were repeating something I’d said, I assumed your intention was to agree with me, though it didn’t sound like it (and as it turned out, you weren’t).
And, yes, I’ve noticed that tone is a problem in a lot of your exchanges, which is why I’m basically disregarding tone in this one, as I said before.
Ah! In that case, I think we agree.
Yes, embedding everything we care about into the optimization target, rather than depending on something outside the optimization process to do important work, is the way to go.
You seemed to be defending the “failsafes” model, which I understand to be importantly different from this, which is where the divergence came from, I think. Apparently I (and, I suspect, some others) misunderstood what you were defending.
Sorry! Glad we worked that out, though.