I liked the painting metaphor, and the diagram of brain-like AGI motivation!
Got a couple of questions below.
It’s possible that you would find this nameless pattern rewarding, were you to come across it. But you can’t like it, because it’s not currently part of your world-model. That also means: you can’t and won’t make a goal-oriented plan to induce that nameless pattern.
I agree that if you haven’t seen something, then its not exactly a part of your world-model. But judging from the fact that it has say positive reward, does this not mean that you like(d) it? Or that aposteriori we can tell it lied inside your “like” region? (it was somewhere in close to things you liked)
For example, say someone enjoys the affection of cat species A, B. Say they haven’t experienced a cat of species C, which is similar in some way to species A, B. Then probably they would get a positive reward from meeting cat C (affection), even though their world model didn’t include it beforehand. Therefore, they should tell us afterwards that in their previous world, cat C should have been in the “like cat” region.
Similarly, you can conceptualize a single future state of the world in many different ways, e.g. by attending to different aspects of it, and it will thereby become more or less appealing. This can lead to circular preferences; I put an example in this footnote[1].
Could it be that intelligent machines have circular preferences? I understand that is the case for humans, but im curious how nuanced the answer for machines is.
Imperfect data/architecture/training alg could lead to weird types of thinking when employed OOD. Do you think it would be helpful to try and measure for the coherency of the system’s actions/thoughts? E.g. make datasets that inspect the agent’s theory of mind (I think Beth Barnes suggested sth like this). I am unsure about what these metrics would imply for AGI safety.
Namely: It seems to me that there is not a distinction between instrumental and final preferences baked deeply into brain algorithms. If you think a thought, and your Steering Subsystem endorses it as a high-value thought, I think the computation looks the same if it’s a high-value thought for instrumental reasons, versus a high-value thought for final reasons.
The answer for this should depend on the size of the space that the optimization algorithm searches over.
It could be the case that the space of possible outcomes for final preferences is smaller than that of instrumental ones, and thus we could afford a different optimization algorithm (or variant thereof).
Also, if instrumental/final preferences were to be mixed together, should we not have been able to encode e.g. strategic behavior (final preference) in RL agents by now?
I agree that if you haven’t seen something, then its not exactly a part of your world-model. But judging from the fact that it has say positive reward, does this not mean that you like(d) it? Or that aposteriori we can tell it lied inside your “like” region? (it was somewhere in close to things you liked)
For example, say someone enjoys the affection of cat species A, B. Say they haven’t experienced a cat of species C, which is similar in some way to species A, B. Then probably they would get a positive reward from meeting cat C (affection), even though their world model didn’t include it beforehand. Therefore, they should tell us afterwards that in their previous world, cat C should have been in the “like cat” region.
Suppose at time t=1 they are completely oblivious to the possible existence or idea of cat C, and at time t=2 they meet cat C and are very happy about it.
We agree that they like cat C at time t=2.
What about at time t=1? I would say “they neither like nor dislike cat C”. I would also say “they would like cat C, if only the thought of cat C occurred to them”.
I think you want to say that they actually already like cat C at t=1. But I don’t think that’s in accordance with common usage of the term “like”. For example, go ask someone on the street: “A year before you first met your current boyfriend (or first saw him, or first become aware of his existence), did you already like him? Did you already think he was cute?” I predict that they will say “no”, and maybe even give you a funny look.
Could it be that intelligent machines have circular preferences? I understand that is the case for humans, but im curious how nuanced the answer for machines is.
Yeah, I for one certainly expect intelligent machines to have circular preferences.
That said, when smart humans notice that they have circular preferences, they tend to adjust their preferences to straighten them out. I assume that AGIs will have the same tendency, and thus that they will have fewer and fewer circular preferences as they learn and think more. (Or perhaps, they’ll have circular preferences that are harder and harder to notice.)
Here’s why I think humans tend to straighten out circular preferences: You can (and naturally do) have a preference “Insofar as my other preferences are self-contradictory, I should try to reduce that aspect of them”, because this is roughly a Pareto-improving thing to do. All of my preferences about future states can be better-actualized simultaneously when I adopt the habit of “noticing when two of my preferences are working at cross-purposes, and when I recognize that happening, preventing them from doing so”. So you gradually build up a bunch of new habits that look for various types of situations that pattern-match to “I’m working at cross-purposes to myself”, and then execute a Pareto improvement—since these habits are by default positively reinforced. It’s loosely analogous to how markets become more self-consistent when a bunch of people are scouting out for arbitrage opportunities, I think.
Do you think it would be helpful to try and measure for the coherency of the system’s actions/thoughts? E.g. make datasets that inspect the agent’s theory of mind (I think Beth Barnes suggested sth like this).
I don’t immediately see why “coherency” would be important to measure for safety purposes, but I dunno, maybe. Measuring theory of mind seems potentially safety-relevant insofar as maybe we want to try to make AGIs that are bad at theory of mind, so that they don’t know how to deceive humans even if they were motivated to. However, I don’t know how you would do that, while still enabling the AGI to do the things we need it to do. Anyway, no strong opinion either way.
if instrumental/final preferences were to be mixed together, should we not have been able to encode e.g. strategic behavior (final preference) in RL agents by now?
It’s true that model-based RL algorithms exist today on GitHub & arXiv. But I think there’s a big space of all possible model-based RL algorithms, and I think that there are still important differences between the model-based RL algorithms currently on GitHub & arXiv, versus the model-based RL algorithm in the brain. I won’t spell out my thoughts on that, for Differential Technological Development reasons. No one really knows all the details anyway.
That said, I’m surprised that you don’t think AlphaZero (for example) has “strategic behavior”. Maybe I’m not sure what you mean by “strategic behavior”.
“A year before you first met your current boyfriend (or first saw him, or first become aware of his existence), did you already like him? Did you already think he was cute?” I predict that they will say “no”, and maybe even give you a funny look.
Okay, now I get the point of “neither like nor dislike” in your original statement.
I was originally thinking of sth as follows: “A year before you met your current boyfriend, would you have thought he was cute, if he was your type?”. But “your type” requires seeing them to get a reference point of if they belong in that class or not. So there’s a circular statement of my own, straightened out, so you had a good point here.
That said, I’m surprised that you don’t think AlphaZero (for example) has “strategic behavior”. Maybe I’m not sure what you mean by “strategic behavior”.
I would say the strategic behavior AlphaZero exhibits is weak (still incredible, specifically with the kind of weird h4 luft lines that the latest supercomputers show). I was thinking of a stronger version dealing with multi-agent environments, continuous state/action spaces, and/or multi-objective reward functions. That said, its seems to me that a different problem has to be solved to get the solution to this.
I liked the painting metaphor, and the diagram of brain-like AGI motivation!
Got a couple of questions below.
I agree that if you haven’t seen something, then its not exactly a part of your world-model. But judging from the fact that it has say positive reward, does this not mean that you like(d) it? Or that aposteriori we can tell it lied inside your “like” region? (it was somewhere in close to things you liked)
For example, say someone enjoys the affection of cat species A, B. Say they haven’t experienced a cat of species C, which is similar in some way to species A, B. Then probably they would get a positive reward from meeting cat C (affection), even though their world model didn’t include it beforehand. Therefore, they should tell us afterwards that in their previous world, cat C should have been in the “like cat” region.
Could it be that intelligent machines have circular preferences? I understand that is the case for humans, but im curious how nuanced the answer for machines is.
Imperfect data/architecture/training alg could lead to weird types of thinking when employed OOD. Do you think it would be helpful to try and measure for the coherency of the system’s actions/thoughts? E.g. make datasets that inspect the agent’s theory of mind (I think Beth Barnes suggested sth like this). I am unsure about what these metrics would imply for AGI safety.
The answer for this should depend on the size of the space that the optimization algorithm searches over.
It could be the case that the space of possible outcomes for final preferences is smaller than that of instrumental ones, and thus we could afford a different optimization algorithm (or variant thereof).
Also, if instrumental/final preferences were to be mixed together, should we not have been able to encode e.g. strategic behavior (final preference) in RL agents by now?
Thanks!
Suppose at time t=1 they are completely oblivious to the possible existence or idea of cat C, and at time t=2 they meet cat C and are very happy about it.
We agree that they like cat C at time t=2.
What about at time t=1? I would say “they neither like nor dislike cat C”. I would also say “they would like cat C, if only the thought of cat C occurred to them”.
I think you want to say that they actually already like cat C at t=1. But I don’t think that’s in accordance with common usage of the term “like”. For example, go ask someone on the street: “A year before you first met your current boyfriend (or first saw him, or first become aware of his existence), did you already like him? Did you already think he was cute?” I predict that they will say “no”, and maybe even give you a funny look.
Yeah, I for one certainly expect intelligent machines to have circular preferences.
That said, when smart humans notice that they have circular preferences, they tend to adjust their preferences to straighten them out. I assume that AGIs will have the same tendency, and thus that they will have fewer and fewer circular preferences as they learn and think more. (Or perhaps, they’ll have circular preferences that are harder and harder to notice.)
Here’s why I think humans tend to straighten out circular preferences: You can (and naturally do) have a preference “Insofar as my other preferences are self-contradictory, I should try to reduce that aspect of them”, because this is roughly a Pareto-improving thing to do. All of my preferences about future states can be better-actualized simultaneously when I adopt the habit of “noticing when two of my preferences are working at cross-purposes, and when I recognize that happening, preventing them from doing so”. So you gradually build up a bunch of new habits that look for various types of situations that pattern-match to “I’m working at cross-purposes to myself”, and then execute a Pareto improvement—since these habits are by default positively reinforced. It’s loosely analogous to how markets become more self-consistent when a bunch of people are scouting out for arbitrage opportunities, I think.
I don’t immediately see why “coherency” would be important to measure for safety purposes, but I dunno, maybe. Measuring theory of mind seems potentially safety-relevant insofar as maybe we want to try to make AGIs that are bad at theory of mind, so that they don’t know how to deceive humans even if they were motivated to. However, I don’t know how you would do that, while still enabling the AGI to do the things we need it to do. Anyway, no strong opinion either way.
It’s true that model-based RL algorithms exist today on GitHub & arXiv. But I think there’s a big space of all possible model-based RL algorithms, and I think that there are still important differences between the model-based RL algorithms currently on GitHub & arXiv, versus the model-based RL algorithm in the brain. I won’t spell out my thoughts on that, for Differential Technological Development reasons. No one really knows all the details anyway.
That said, I’m surprised that you don’t think AlphaZero (for example) has “strategic behavior”. Maybe I’m not sure what you mean by “strategic behavior”.
Okay, now I get the point of “neither like nor dislike” in your original statement.
I was originally thinking of sth as follows: “A year before you met your current boyfriend, would you have thought he was cute, if he was your type?”. But “your type” requires seeing them to get a reference point of if they belong in that class or not. So there’s a circular statement of my own, straightened out, so you had a good point here.
I would say the strategic behavior AlphaZero exhibits is weak (still incredible, specifically with the kind of weird h4 luft lines that the latest supercomputers show). I was thinking of a stronger version dealing with multi-agent environments, continuous state/action spaces, and/or multi-objective reward functions. That said, its seems to me that a different problem has to be solved to get the solution to this.