Sorry if I’m misunderstanding something here, but doesn’t this violate the ol’ Rationalist principle of “ignorance is in the mind”[1]. I see your argument about one specific scenario above, but I’m not largely convinced by it and am wary of how it seems to fail to generalize. Insofar as we aren’t bringing certain computational stipulations onto the field[2], I don’t actually understand what it would mean for a theory of mind to be “unlearnable”.
To some extent, I could[3] simulate you and record all pairings of input-world-state to output-reaction, giving me a perfect understanding of you. This is the trivial objection, but what makes it wrong?
When I looked, it seems like we’ve done a good job at filing the important articles under “Mind Projection Fallacy”, so those articles are what I’m jokingly referring to
By which I mean objections I have heard along the lines of: There are more world-states possible than are differentiable by humans, so by pigeonhole principle, at least some states must not be orderable by humans, so you cannot build a preference map of humans that is “complete”
I mean, in theory, and also it isn’t important what we mean by “simulate” here so please no one get distracted if you disagree with the concept of simulation
Perfect predictive models of me (even, to a lesser extent, perfect predictive models of my internals) are insufficient to figure out what my preferences are, see https://arxiv.org/abs/1712.05812
Is that true if I change my simulation to just simulate all the particles in your brain?
If you are going to answer “In that case, you could learn it”, then is the “strong version of the unlearnability hypothesis” completely false? If it’s not false, why?
I still think you and I might be calling different concepts “unlearnable”. I do not see how any existing physical thing can be unlearnable since it seems you could learn it by just looking at all its parts[1].
Assuming you are able to do that, of course. I don’t mean to be unfair by focusing on the “strong version” of your theory, it’s just the one that’s more assailable while I don’t have good footing with the idea (so it’s the one I can learn from without making judgement calls)
Is that true if I change my simulation to just simulate all the particles in your brain?
Yes.
The preferences of a system are an interpretation of that system, not a fact about that system. Not all interpretations are equal (most are stupid) but there is no easy single interpretation that gives preferences from brain states. And these interpretations cannot themselves be derived from observations.
I don’t understand this. As far as I can tell, I know what my preferences are, and so that information should in some way be encoded in a perfect simulation of my brain. Saying there is no way at all to infer my preferences from all the information in my brain seems to contradict the fact that I can do it right now, even if me telling them to you isn’t sufficient for you to infer them.
Once an algorithm is specified, there is no more extra information to specify how it feels from the inside. I don’t see how there can be any more information necessary on top of a perfect model of me to specify my feeling of having certain preferences.
The theoretical argument can be found here: https://arxiv.org/abs/1712.05812 ; basically, “goals plus (ir)rationality” contains strictly more information than “full behaviour or policy”.
Humans have a theory of mind that allows us to infer the preferences and rationality of others (and ourselves) with a large amount of agreement from human to human. In computer science terms, we can take agent behaviour and add “labels” about the agent’s goals (“this human is ‘happy’ ”; “they have ‘failed’ to achieve their goal”, etc...).
But accessing this theory of mind is not trivial; we either have to define it explicitly, or point to where in the human mind it resides (or, most likely, a mixture of the two). One way or another, we need to give the AI enough labelled information that it can correctly infer this theory of mind—unlabelled information (ie pure observations) are not enough.
If we have access to the internals of the human brain, the task is easier, because we can point to various parts of it and say things like “this is a pleasure centre, this part is involved in retrieval of information, etc...”. We still need labelled information, but we can (probably) get away with less.
I think I understand now. My best guess is that if your proof was applied to my example the conclusion would be that my example only pushes the problem back. To specify human values via a method like I was suggesting, you would still need to specify the part of the algorithm that “feels like” it has values, which is a similar type of problem.
I think I hadn’t grokked that your proof says something about the space of all abstract value/knowledge systems whereas my thinking was solely about humans. As I understand it, an algorithm that picks out human values from a simulation of the human brain will correspondingly do worse on other types of mind.
If preferences cannot be “learned” from a full physical model of a brain, that isn’t it the truth that a human theory of mind is unlearnable as well?
I don’t see a good reason to privilege having the original copy of a brain here. If you’re willing to grant me that a brain can be copied, there is nothing the brain-holder should be unable to learn that the brain-user knows.
Sorry if I’m misunderstanding something here, but doesn’t this violate the ol’ Rationalist principle of “ignorance is in the mind”[1]. I see your argument about one specific scenario above, but I’m not largely convinced by it and am wary of how it seems to fail to generalize. Insofar as we aren’t bringing certain computational stipulations onto the field[2], I don’t actually understand what it would mean for a theory of mind to be “unlearnable”.
To some extent, I could[3] simulate you and record all pairings of input-world-state to output-reaction, giving me a perfect understanding of you. This is the trivial objection, but what makes it wrong?
When I looked, it seems like we’ve done a good job at filing the important articles under “Mind Projection Fallacy”, so those articles are what I’m jokingly referring to
By which I mean objections I have heard along the lines of: There are more world-states possible than are differentiable by humans, so by pigeonhole principle, at least some states must not be orderable by humans, so you cannot build a preference map of humans that is “complete”
I mean, in theory, and also it isn’t important what we mean by “simulate” here so please no one get distracted if you disagree with the concept of simulation
Perfect predictive models of me (even, to a lesser extent, perfect predictive models of my internals) are insufficient to figure out what my preferences are, see https://arxiv.org/abs/1712.05812
Is that true if I change my simulation to just simulate all the particles in your brain?
If you are going to answer “In that case, you could learn it”, then is the “strong version of the unlearnability hypothesis” completely false? If it’s not false, why?
I still think you and I might be calling different concepts “unlearnable”. I do not see how any existing physical thing can be unlearnable since it seems you could learn it by just looking at all its parts[1].
Assuming you are able to do that, of course. I don’t mean to be unfair by focusing on the “strong version” of your theory, it’s just the one that’s more assailable while I don’t have good footing with the idea (so it’s the one I can learn from without making judgement calls)
Yes.
The preferences of a system are an interpretation of that system, not a fact about that system. Not all interpretations are equal (most are stupid) but there is no easy single interpretation that gives preferences from brain states. And these interpretations cannot themselves be derived from observations.
I don’t understand this. As far as I can tell, I know what my preferences are, and so that information should in some way be encoded in a perfect simulation of my brain. Saying there is no way at all to infer my preferences from all the information in my brain seems to contradict the fact that I can do it right now, even if me telling them to you isn’t sufficient for you to infer them.
Once an algorithm is specified, there is no more extra information to specify how it feels from the inside. I don’t see how there can be any more information necessary on top of a perfect model of me to specify my feeling of having certain preferences.
The theoretical argument can be found here: https://arxiv.org/abs/1712.05812 ; basically, “goals plus (ir)rationality” contains strictly more information than “full behaviour or policy”.
Humans have a theory of mind that allows us to infer the preferences and rationality of others (and ourselves) with a large amount of agreement from human to human. In computer science terms, we can take agent behaviour and add “labels” about the agent’s goals (“this human is ‘happy’ ”; “they have ‘failed’ to achieve their goal”, etc...).
But accessing this theory of mind is not trivial; we either have to define it explicitly, or point to where in the human mind it resides (or, most likely, a mixture of the two). One way or another, we need to give the AI enough labelled information that it can correctly infer this theory of mind—unlabelled information (ie pure observations) are not enough.
If we have access to the internals of the human brain, the task is easier, because we can point to various parts of it and say things like “this is a pleasure centre, this part is involved in retrieval of information, etc...”. We still need labelled information, but we can (probably) get away with less.
I think I understand now. My best guess is that if your proof was applied to my example the conclusion would be that my example only pushes the problem back. To specify human values via a method like I was suggesting, you would still need to specify the part of the algorithm that “feels like” it has values, which is a similar type of problem.
I think I hadn’t grokked that your proof says something about the space of all abstract value/knowledge systems whereas my thinking was solely about humans. As I understand it, an algorithm that picks out human values from a simulation of the human brain will correspondingly do worse on other types of mind.
If preferences cannot be “learned” from a full physical model of a brain, that isn’t it the truth that a human theory of mind is unlearnable as well?
I don’t see a good reason to privilege having the original copy of a brain here. If you’re willing to grant me that a brain can be copied, there is nothing the brain-holder should be unable to learn that the brain-user knows.
See comment here: https://www.lesswrong.com/posts/kMJxwCZ4mc9w4ezbs/how-an-alien-theory-of-mind-might-be-unlearnable?commentId=iPitpgNxwJH2e98CK