i want a better conceptual understanding of what “fundamental values” means, and how to disentangled that from beliefs (ex.: in an LLM). like, is there a meaningful way we can say that a “cat classifier” is valuing classifying cats even though it sometimes fail?
I guess for a cat classifier, disentanglement is not possible, because it wants to classify things as cats if and only if it believes they are cats. Since values and beliefs are perfectly correlated here, there is no test we could perform which would distinguish what it wants from what it believes.
Though we could assume we don’t know what the classifier wants. If it doesn’t classify a cat image as “yes”, it could be because it is (say) actually a dog classifier, and it correctly believes the image contains something other than a dog. Or it could be because it is indeed a cat classifier, but it mistakenly believes the image doesn’t show a cat.
One way to find out would be to give the classifier an image of the same subject, but in higher resolution or from another angle, and check whether it changes its classification to “yes”. If it is a cat classifier, it is likely it won’t make the mistake again, so it probably changes its classification to “yes”. If it is a dog classifier, it will likely stay with “no”.
This assumes that mistakes are random and somewhat unlikely, so will probably disappear when the evidence is better or of a different sort. Beliefs react to such changes in evidence, while values don’t.
Thanks for engaging with my post. I keep thinking about that question.
I’m not quite sure what you mean by “values and beliefs are perfectly correlated here”, but I’m guessing you mean they are “entangled”.
there is no test we could perform which would distinguish what it wants from what it believes.
Ah yeah, that seems true for all systems (at least if you can only look at their behaviors and not their mind); ref.: Occam’s razor is insufficient to infer the preferences of irrational agents. Summary: In principle, all possible sets of possible value-system has a belief-system that can lead to any set of actions.
So, in principle, the cat classifier, looked from the outside, could actually be a human mind wanting to live a flourishing human life, but with a decision making process that’s so wrong that the human does nothing but say “cat” when they see a cat, thinking this will lead them to achieve all their deepest desires.
I think the paper says noisy errors would cancel each other (?), but correlated errors wouldn’t go away. One way to solve for them would be coming up with “minimal normative assumptions”.
I guess that’s as much relevant to the “value downloading” as it is to the “value (up)loading” on. (I just coined the term “value downloading” to refer to the problem of determining human values as opposed to the problem of programming values into an AI.)
The solution-space for determining the values of an agent at a high-level seems to be (I’m sure that’s too simplistic, and maybe even a bit confused, but just thinking out loud):
Look in their brain directly to understand their values (and maybe that also requires solving the symbol-grounding problem)
Determine their planner (ie. “decision-making process”) (ex.: using some interpretability methods), and determine their values from the policy and the planner
Make minimal normative assumptions about their reasoning errors and approximations to determine their planner from their behavior (/policy)
Augment them to make their planners flawless (I think your example fits into improving the planner by improving the image resolution—I love that thought 💡)
Ask the agent questions directly about their fundamental values which doesn’t require any planning (?)
Approaches like “iterated amplifications” correspond to some combination of the above.
But going back to my original question, I think a similar way to put it is that I wonder how complex the concept of “preferences″/”wanting” is. Is it a (messy) concept that’s highly dependent on our evolutionary history (ie. not what we want, which definitely is, but the concept of wanting itself) or is it a concept that all alien civilizations use in exactly the same way as us? It seems like a fundamental concept, but can we define it in a fully reductionist (and concise) way? What’s the simplest example of something that “wants” things? What’s the simplest planner a wanting-thing can have? Is it no planner at all?
A policy seems well defined–it’s basically an input-output map. We’re intuitively thinking of a policy as a planner + an optimization target, so if either of the latter 2 can be defined robustly, then it seems like we should be able to define the other as well. Although, maybe for a given planner or optimization target there are many possible optimization targets or planners to get a given policy, but maybe Occam’s razor would be helpful here.
Relatedly, I also just read Reward is not the optimization target which is relevant and overlaps a lot with ideas I wanted to write about (ie. neural-net-executor, not reward-maximizers as a reference to Adaptation-Executers, not Fitness-Maximizers). A reward function R will only select a policy π that wants R if wanting R is the best way to achieve R in the environment the policy is being developped. (I’m speaking loosely: technically not if it’s the “best” way, but just if it’s the way the weight-update function works.)
Anyway, that’s a thread that seems valuable to pull more. If you have any other thoughts or pointers, I’d be interested 🙂
i want a better conceptual understanding of what “fundamental values” means, and how to disentangled that from beliefs (ex.: in an LLM). like, is there a meaningful way we can say that a “cat classifier” is valuing classifying cats even though it sometimes fail?
I guess for a cat classifier, disentanglement is not possible, because it wants to classify things as cats if and only if it believes they are cats. Since values and beliefs are perfectly correlated here, there is no test we could perform which would distinguish what it wants from what it believes.
Though we could assume we don’t know what the classifier wants. If it doesn’t classify a cat image as “yes”, it could be because it is (say) actually a dog classifier, and it correctly believes the image contains something other than a dog. Or it could be because it is indeed a cat classifier, but it mistakenly believes the image doesn’t show a cat.
One way to find out would be to give the classifier an image of the same subject, but in higher resolution or from another angle, and check whether it changes its classification to “yes”. If it is a cat classifier, it is likely it won’t make the mistake again, so it probably changes its classification to “yes”. If it is a dog classifier, it will likely stay with “no”.
This assumes that mistakes are random and somewhat unlikely, so will probably disappear when the evidence is better or of a different sort. Beliefs react to such changes in evidence, while values don’t.
Thanks for engaging with my post. I keep thinking about that question.
I’m not quite sure what you mean by “values and beliefs are perfectly correlated here”, but I’m guessing you mean they are “entangled”.
Ah yeah, that seems true for all systems (at least if you can only look at their behaviors and not their mind); ref.: Occam’s razor is insufficient to infer the preferences of irrational agents. Summary: In principle, all possible sets of possible value-system has a belief-system that can lead to any set of actions.
So, in principle, the cat classifier, looked from the outside, could actually be a human mind wanting to live a flourishing human life, but with a decision making process that’s so wrong that the human does nothing but say “cat” when they see a cat, thinking this will lead them to achieve all their deepest desires.
I think the paper says noisy errors would cancel each other (?), but correlated errors wouldn’t go away. One way to solve for them would be coming up with “minimal normative assumptions”.
I guess that’s as much relevant to the “value downloading” as it is to the “value (up)loading” on. (I just coined the term “value downloading” to refer to the problem of determining human values as opposed to the problem of programming values into an AI.)
The solution-space for determining the values of an agent at a high-level seems to be (I’m sure that’s too simplistic, and maybe even a bit confused, but just thinking out loud):
Look in their brain directly to understand their values (and maybe that also requires solving the symbol-grounding problem)
Determine their planner (ie. “decision-making process”) (ex.: using some interpretability methods), and determine their values from the policy and the planner
Make minimal normative assumptions about their reasoning errors and approximations to determine their planner from their behavior (/policy)
Augment them to make their planners flawless (I think your example fits into improving the planner by improving the image resolution—I love that thought 💡)
Ask the agent questions directly about their fundamental values which doesn’t require any planning (?)
Approaches like “iterated amplifications” correspond to some combination of the above.
But going back to my original question, I think a similar way to put it is that I wonder how complex the concept of “preferences″/”wanting” is. Is it a (messy) concept that’s highly dependent on our evolutionary history (ie. not what we want, which definitely is, but the concept of wanting itself) or is it a concept that all alien civilizations use in exactly the same way as us? It seems like a fundamental concept, but can we define it in a fully reductionist (and concise) way? What’s the simplest example of something that “wants” things? What’s the simplest planner a wanting-thing can have? Is it no planner at all?
A policy seems well defined–it’s basically an input-output map. We’re intuitively thinking of a policy as a planner + an optimization target, so if either of the latter 2 can be defined robustly, then it seems like we should be able to define the other as well. Although, maybe for a given planner or optimization target there are many possible optimization targets or planners to get a given policy, but maybe Occam’s razor would be helpful here.
Relatedly, I also just read Reward is not the optimization target which is relevant and overlaps a lot with ideas I wanted to write about (ie. neural-net-executor, not reward-maximizers as a reference to Adaptation-Executers, not Fitness-Maximizers). A reward function R will only select a policy π that wants R if wanting R is the best way to achieve R in the environment the policy is being developped. (I’m speaking loosely: technically not if it’s the “best” way, but just if it’s the way the weight-update function works.)
Anyway, that’s a thread that seems valuable to pull more. If you have any other thoughts or pointers, I’d be interested 🙂