That’s a good point. I think the distinction is that these people are modifying their own instrumental values, but leaving their terminal values (the big meaning of life blob of computation) unchanged. I’d go so far as to say that people frequently do this trick by mistake, when they convince themselves that they have various terminal values. This certainly explains things like happy death spirals.
On the other hand, this would be very difficult (impossible?) to test.
EDIT: I’ve given this a bit more thought, and I wonder what it would feel like from the inside to be a machine learning algorithm that could make limited small self-modifications to it’s own utility function, including it’s optimization criteria. This seems like a “simple” enough hack that evolution could have generated it. This also seems to mirror real human psychology surprisingly well.
I’m imagining trying to answer the question “what I would like to change my utility function to”, while simultaneously not fully understanding the dangers of messing around like that. It seems like this could easily generate people like religious extremists, even if earlier versions of those people would never have deliberately tried to become that twisted. If the other side seems completely wrong and evil, then I can picture disliking parts of myself that resemble the other side, as well as well as any empathy I may have for them. I can imagine how suppressing those parts of myself would lead to extremism.
I wonder what the official Yudkowski position on this is. More importantly, I wonder what happens if you get this question wrong while trying to build a Friendly AI. It seems like there might be issues if you assume a static Coherent Extrapolated Volition if it is actually dynamically changing, or vice versa.
That’s a good point. I think the distinction is that these people are modifying their own instrumental values, but leaving their terminal values (the big meaning of life blob of computation) unchanged. I’d go so far as to say that people frequently do this trick by mistake, when they convince themselves that they have various terminal values. This certainly explains things like happy death spirals.
On the other hand, this would be very difficult (impossible?) to test.
EDIT: I’ve given this a bit more thought, and I wonder what it would feel like from the inside to be a machine learning algorithm that could make limited small self-modifications to it’s own utility function, including it’s optimization criteria. This seems like a “simple” enough hack that evolution could have generated it. This also seems to mirror real human psychology surprisingly well.
I’m imagining trying to answer the question “what I would like to change my utility function to”, while simultaneously not fully understanding the dangers of messing around like that. It seems like this could easily generate people like religious extremists, even if earlier versions of those people would never have deliberately tried to become that twisted. If the other side seems completely wrong and evil, then I can picture disliking parts of myself that resemble the other side, as well as well as any empathy I may have for them. I can imagine how suppressing those parts of myself would lead to extremism.
I wonder what the official Yudkowski position on this is. More importantly, I wonder what happens if you get this question wrong while trying to build a Friendly AI. It seems like there might be issues if you assume a static Coherent Extrapolated Volition if it is actually dynamically changing, or vice versa.