Your taxonomy in section 4 was new and interesting to me. I would also mention the utility rebinding problem, that goals can drift because the AI’s ontology changes (e.g. because it figures out deeper understanding in some domain). I guess there are actually two problems here:
Formalizing the utility rebinding mechanism so that concepts get rebound to the corresponding natural abstractions of the new deeper ontology.
For value-laden concepts the AI likely lacks the underlying human intuitions for figuring out how the utility ought to be rebound. (E.g. when we have a concept like “conscious happiness”, and the AI finds what cognitive processes in our brains are associated with this, it may be ambiguous whether to rebind the concept to the existence of thoughts like ‘I notice the thought “I notice the thought <expected utility increase>”’ running through a mind/brain, or whether to rebind it in a way to include a cluster of sensations (e.g. tensions in our face from laughter) that are present in our minds/brains (, or other options). (Sry maybe bad example which might require some context of my fuzzy thoughts on qualia which might actually be wrong.))
Yep ontological crises are a good example of another way that goals can be unstable. I’m not sure I understood how 2 is different from 1.
I’m also not sure that rebinding to the new ontology is the right approach (although I don’t have any specific good approach). When I try to think about this kind of problem I get stuck on not understanding the details of how an ontology/worldmodel can or should work. So I’m pretty enthusiastic about work that clarifies my understanding here (where infrabayes, natural latents and finite factored sets all seem like the sort of thing that might lead to a clearer picture).
I’m not sure I understood how 2 is different from 1.
(1) is the problem that utility rebinding might just not happen properly by default. An extreme example is how AIXI-atomic fails here. Intuitively I’d guess that once the AI is sufficiently smart and self-reflective, it might just naturally see the correspondence between the old and the new ontology and rebind values accordingly. But before that point it might get significant value drift. (E.g. if it valued warmth and then learns that there actually are just moving particles, it might just drop that value shard because it thinks there’s no such (ontologically basic) thing as warmth.)
(2) is the problem that the initial ontology of the AI is insufficient to fully capture human values, so if you only specify human values as well as possible in that ontology, it would still lack the underlying intuitions humans would use to rebind their values and might rebind differently. Aka while I think many normal abstractions we use like “tree” are quite universal natural abstractions where the rebinding is unambiguous, many value-laden concepts like “happiness” are much less natural abstractions for non-human minds and it’s actually quite hard to formally pin down what we value here. (This problem is human-value-specific and perhaps less relevant if you aim the AI at a pivotal act.)
When I try to think about this kind of problem I get stuck on not understanding the details of how an ontology/worldmodel can or should work.
Not sure if this helps, but I heard that Vivek’s group came up with the same diamond maximizer proposal as I did, so if you remember that you can use it as a simple toy frame to think about rebinding. But sure we need a much better frame for thinking about the AI’s world model.
This is an amazing report!
Your taxonomy in section 4 was new and interesting to me. I would also mention the utility rebinding problem, that goals can drift because the AI’s ontology changes (e.g. because it figures out deeper understanding in some domain). I guess there are actually two problems here:
Formalizing the utility rebinding mechanism so that concepts get rebound to the corresponding natural abstractions of the new deeper ontology.
For value-laden concepts the AI likely lacks the underlying human intuitions for figuring out how the utility ought to be rebound. (E.g. when we have a concept like “conscious happiness”, and the AI finds what cognitive processes in our brains are associated with this, it may be ambiguous whether to rebind the concept to the existence of thoughts like ‘I notice the thought “I notice the thought <expected utility increase>”’ running through a mind/brain, or whether to rebind it in a way to include a cluster of sensations (e.g. tensions in our face from laughter) that are present in our minds/brains (, or other options). (Sry maybe bad example which might require some context of my fuzzy thoughts on qualia which might actually be wrong.))
Yep ontological crises are a good example of another way that goals can be unstable.
I’m not sure I understood how 2 is different from 1.
I’m also not sure that rebinding to the new ontology is the right approach (although I don’t have any specific good approach). When I try to think about this kind of problem I get stuck on not understanding the details of how an ontology/worldmodel can or should work. So I’m pretty enthusiastic about work that clarifies my understanding here (where infrabayes, natural latents and finite factored sets all seem like the sort of thing that might lead to a clearer picture).
(1) is the problem that utility rebinding might just not happen properly by default. An extreme example is how AIXI-atomic fails here. Intuitively I’d guess that once the AI is sufficiently smart and self-reflective, it might just naturally see the correspondence between the old and the new ontology and rebind values accordingly. But before that point it might get significant value drift. (E.g. if it valued warmth and then learns that there actually are just moving particles, it might just drop that value shard because it thinks there’s no such (ontologically basic) thing as warmth.)
(2) is the problem that the initial ontology of the AI is insufficient to fully capture human values, so if you only specify human values as well as possible in that ontology, it would still lack the underlying intuitions humans would use to rebind their values and might rebind differently. Aka while I think many normal abstractions we use like “tree” are quite universal natural abstractions where the rebinding is unambiguous, many value-laden concepts like “happiness” are much less natural abstractions for non-human minds and it’s actually quite hard to formally pin down what we value here. (This problem is human-value-specific and perhaps less relevant if you aim the AI at a pivotal act.)
Not sure if this helps, but I heard that Vivek’s group came up with the same diamond maximizer proposal as I did, so if you remember that you can use it as a simple toy frame to think about rebinding. But sure we need a much better frame for thinking about the AI’s world model.
I see, thanks! I agree these are both really important problems.