Yep ontological crises are a good example of another way that goals can be unstable. I’m not sure I understood how 2 is different from 1.
I’m also not sure that rebinding to the new ontology is the right approach (although I don’t have any specific good approach). When I try to think about this kind of problem I get stuck on not understanding the details of how an ontology/worldmodel can or should work. So I’m pretty enthusiastic about work that clarifies my understanding here (where infrabayes, natural latents and finite factored sets all seem like the sort of thing that might lead to a clearer picture).
I’m not sure I understood how 2 is different from 1.
(1) is the problem that utility rebinding might just not happen properly by default. An extreme example is how AIXI-atomic fails here. Intuitively I’d guess that once the AI is sufficiently smart and self-reflective, it might just naturally see the correspondence between the old and the new ontology and rebind values accordingly. But before that point it might get significant value drift. (E.g. if it valued warmth and then learns that there actually are just moving particles, it might just drop that value shard because it thinks there’s no such (ontologically basic) thing as warmth.)
(2) is the problem that the initial ontology of the AI is insufficient to fully capture human values, so if you only specify human values as well as possible in that ontology, it would still lack the underlying intuitions humans would use to rebind their values and might rebind differently. Aka while I think many normal abstractions we use like “tree” are quite universal natural abstractions where the rebinding is unambiguous, many value-laden concepts like “happiness” are much less natural abstractions for non-human minds and it’s actually quite hard to formally pin down what we value here. (This problem is human-value-specific and perhaps less relevant if you aim the AI at a pivotal act.)
When I try to think about this kind of problem I get stuck on not understanding the details of how an ontology/worldmodel can or should work.
Not sure if this helps, but I heard that Vivek’s group came up with the same diamond maximizer proposal as I did, so if you remember that you can use it as a simple toy frame to think about rebinding. But sure we need a much better frame for thinking about the AI’s world model.
Yep ontological crises are a good example of another way that goals can be unstable.
I’m not sure I understood how 2 is different from 1.
I’m also not sure that rebinding to the new ontology is the right approach (although I don’t have any specific good approach). When I try to think about this kind of problem I get stuck on not understanding the details of how an ontology/worldmodel can or should work. So I’m pretty enthusiastic about work that clarifies my understanding here (where infrabayes, natural latents and finite factored sets all seem like the sort of thing that might lead to a clearer picture).
(1) is the problem that utility rebinding might just not happen properly by default. An extreme example is how AIXI-atomic fails here. Intuitively I’d guess that once the AI is sufficiently smart and self-reflective, it might just naturally see the correspondence between the old and the new ontology and rebind values accordingly. But before that point it might get significant value drift. (E.g. if it valued warmth and then learns that there actually are just moving particles, it might just drop that value shard because it thinks there’s no such (ontologically basic) thing as warmth.)
(2) is the problem that the initial ontology of the AI is insufficient to fully capture human values, so if you only specify human values as well as possible in that ontology, it would still lack the underlying intuitions humans would use to rebind their values and might rebind differently. Aka while I think many normal abstractions we use like “tree” are quite universal natural abstractions where the rebinding is unambiguous, many value-laden concepts like “happiness” are much less natural abstractions for non-human minds and it’s actually quite hard to formally pin down what we value here. (This problem is human-value-specific and perhaps less relevant if you aim the AI at a pivotal act.)
Not sure if this helps, but I heard that Vivek’s group came up with the same diamond maximizer proposal as I did, so if you remember that you can use it as a simple toy frame to think about rebinding. But sure we need a much better frame for thinking about the AI’s world model.
I see, thanks! I agree these are both really important problems.