I’m really confused why “short” world include sense (1) rather than only sense (2). If “corrigibly is about short-term preferences on reflection” then this seems to be a claim that corrigible AI should understand us as preferring to eat candy and junk food, because on reflection we do like how it tastes, we just choose not to eat it because of longer-term concerns—so a corrigible system ignores the longer-term concerns and interpretations us as wanting candy and junk food.
Perhaps you intend sense (1) where “short” means ~100 years, rather than ~10 minutes, so that the system doesn’t interpret us as wanting candy and junk food. But this similarly creates problems when we think longer than 100 years; the system wouldn’t take those thoughts seriously.
It seems much more sensible to me for “short” in the context of this discussion to mean (2) only. But perhaps I misunderstood something.
One of us just misunderstood (1), I don’t think there is any difference.
I mean preferences about what happens over the near future, but the way I rank “what happens in the near future” will likely be based on its consequences (further in the future, and in other possible worlds, and etc.). So I took (1) to be basically equivalent to (2).
“Terminal preferences over the near future” is not a thing I often think about and I didn’t realize it was a candidate interpretation (normally when I write about short-term preferences I’m writing about things like control, knowledge, and resource acquisition).
I’m really confused why “short” world include sense (1) rather than only sense (2). If “corrigibly is about short-term preferences on reflection” then this seems to be a claim that corrigible AI should understand us as preferring to eat candy and junk food, because on reflection we do like how it tastes, we just choose not to eat it because of longer-term concerns—so a corrigible system ignores the longer-term concerns and interpretations us as wanting candy and junk food.
Perhaps you intend sense (1) where “short” means ~100 years, rather than ~10 minutes, so that the system doesn’t interpret us as wanting candy and junk food. But this similarly creates problems when we think longer than 100 years; the system wouldn’t take those thoughts seriously.
It seems much more sensible to me for “short” in the context of this discussion to mean (2) only. But perhaps I misunderstood something.
One of us just misunderstood (1), I don’t think there is any difference.
I mean preferences about what happens over the near future, but the way I rank “what happens in the near future” will likely be based on its consequences (further in the future, and in other possible worlds, and etc.). So I took (1) to be basically equivalent to (2).
“Terminal preferences over the near future” is not a thing I often think about and I didn’t realize it was a candidate interpretation (normally when I write about short-term preferences I’m writing about things like control, knowledge, and resource acquisition).