(I probably didn’t understand some parts of your comment.) My point isn’t connected with abstract physics, I guess. But maybe you can use “information movements” to formulate my idea. When AI does “reward hacking” it significantly alters the information flow inside of its reward system. When AI solves a task via deception—it alters the “information flow” between itself and the human. And the good thing is that most tasks assume the same types of “information flow”. So you can specify what types of “information flow” are good and bad. Another good thing is that all of this isn’t directly connected to human values, so you don’t have to encode “absolute understanding of human values” in the AI. By the way, I believe that there’s a special type of reasoning for dealing with those “information flows”.
The topic of information flows is also relevant to “Eliciting Latent Knowledge” problem. In the problem you need to create the right type of information flow between the model and the reporter, avoid the negative change of the information flow.
I would argue gradient descent certainly can be interpreted as a weak approximation to a financial system, stronger if carefully normalized, even stronger with a strict activation conservation law, etc. but this post is very long and I didn’t retain the points in the examples after a couple of rereads.
AI needs to care about a system (and some of it properties) in the outside world, a system it learns about. I compared my idea to gradient descent a couple of times in the post, directly (“Comparing Alignment ideas”) and indirectly (by connecting it to other Alignment proposals).
Another good thing is that all of this isn’t directly connected to human values, so you don’t have to encode “absolute understanding of human values” in the AI.
I don’t get this part, at all. (But I didn’t understand the purpose/implications of most parts of the OP.)
Why doesn’t the AI have to understand human values, in your proposal?
In the OP, you state:
The point is that AI doesn’t just value (X). AI makes sure that there exists a system that gives (X) the proper value. And that system has to have certain properties. If AI finds a solution that breaks the properties of that system, AI doesn’t use this solution. That’s the idea: AI can realize that some rewards are unjust because they break the entire reward system.
From the rest of your post, it seems clear that “proper value” means something like “value to humans”. So it sure seems to me like the AI needs to understand human values in order to implement this kind of check.
You draw a distinction between human values and human norms. For example, an AI can respect someone’s autonomy before the AI gets to know their values and the exact amount of autonomy they want.
I draw the same distinction, but more abstract. It’s a distinction between human values and properties of any system/task. AI can respect keeping some properties of its reward systems intact before it gets to know human values.
I think even in very simple games an AI could learn important properties of systems. Which would significantly help the AI to respect human values.
You can split possible effects of AI’s actions into three domains. All of them are different (with different ideas), even though they partially intersect and can be formulated in terms of each other. Traditionally we focus on the first two domains:
(Not) accomplishing a goal. Utility functions are about this.
(Not) violating human values. Models of human feedback are about this.
(Not) modifying a system without breaking it. Impact measures are about this.
My idea is about combining all of this (mostly 2 and 3) into a single approach. Or generalizing ideas for the third domain. There isn’t a lot of ideas for the third one, as far as I know. Maybe people are not aware enough about that domain.
Why doesn’t the AI have to understand human values, in your proposal?
I meant that some AIs need to start with understanding human values (perfectly) and others don’t. Here’s an analogy:
Imagine a person who respects laws. She ends up in a foreign country. She looks up the laws. She respects and doesn’t break them. She has an abstract goal that depends on what she learns about the world.
Imagine a person who respects “killing people”. She ends up in a foreign country. She looks up the laws. She doesn’t break them for some time. She accumulates power. Then she breaks all the laws and kills everyone. She has a particular goal that doesn’t depend on anything she learns.
The point of my idea is to create an AI that respects abstract laws of systems, abstract laws of tasks. The AI of the 1st type. (Of course, in reality the distinction isn’t black and white, but the difference still exists.)
This is just my intuition, but it seems like the core intuition of a “money system” as you use it in the post is the same as the core intuition behind utility functions (ie, everything must have a price ≈ everything must have a quantifiable utility).
I think we can try to solve AI Alignment this way:
Model human values and objects in the world as a “money system” (a system of meaningful trades). Make the AGI learn the correct “money system”, specify some obviously incorrect “money systems”.
Basically, you ask the AI “make paperclips that have the value of paperclips for humans”. AI can do anything using all the power in the Universe. But killing everyone is not an option: paperclips can’t be more valuable than humanity. Money analogy: if you killed everyone (and destroyed everything) to create some dollars, those dollars aren’t worth anything. So you haven’t actually gained any money at all.
In utility-theoretic terms, this is like saying that money is an instrumental goal, not a terminal goal. Or at least, money as-terminal-goal has a low weight compared to other things (eg, human lives). Or perhaps more faithful to what you want: money as-terminal-goal is dependent on a context.[1][2]
So it seems to me like this still faces the same basic challenges as most other approaches, IE, making the system robustly care about external objects which we can’t get perfect feedback about. How do you get it to care about the context? How do you get it to think killing humans is “expensive”? How do you ask the system to “make paperclips that have the value of paperclips for humans”?
I meant that some AIs need to start with understanding human values (perfectly) and others don’t.
It seems like any proponent of #2 (human feedback, aka, value learning) would already agree with this idea; whereas your post gave me the sense that you think something more radical is here.
Reiterating the quote from the OP that I quoted before:
The point is that AI doesn’t just value (X). AI makes sure that there exists a system that gives (X) the proper value. And that system has to have certain properties. If AI finds a solution that breaks the properties of that system, AI doesn’t use this solution. That’s the idea: AI can realize that some rewards are unjust because they break the entire reward system.
My best guess about how you want to combine #1 and #2 with #3 is that you want to infer the proper value of things from the environment. EG, if most gold sits around in vaults, then the value of gold is probably tied to sitting around in vaults.
I remember some work a few years ago on this approach—specifically, using the built environment of humans (together with an assumption that humans are fairly good at optimizing for their own preferences) to infer human values. Sadly, I’m unable to find a reference; maybe it was never published? (Probably I’ve just forgotten the relevant keywords to search for)
The distinction between instrumental goals vs “terminal goals that depend on some context” is rather blurry, because the way we distinguish between terminal and instrumental goals (from the outside, behaviorally) is how much they vary based on context. (EG, if I take away the other basketball players, the audience, and the money, will one basketball player still try to perform a slam dunk?)
One reason for abandoning utility functions is, perhaps, an instinct that everything must be instrumental, because nothing is truly terminal. I discussed how to do this while keeping most of expected utility theory in An Orthodox Case Against Utility Functions.
The AI doesn’t have to know the precise price of everything. The AI needs to make sure that a price doesn’t break the desired properties of a system. If paperclips are worth more than everything else in the universe, it would destroy almost any system. So, this price is unlikely to be good.
Or perhaps more faithful to what you want: money as-terminal-goal is dependent on a context.[1][2]
So it seems to me like this still faces the same basic challenges as most other approaches, IE, making the system robustly care about external objects which we can’t get perfect feedback about. How do you get it to care about the context? How do you get it to think killing humans is “expensive”? How do you ask the system to “make paperclips that have the value of paperclips for humans”?
There are two questions to ask:
How does the AI learn to care about this?
What do we gain by making the AI care about this?
If we don’t discuss 100% answers, it’s very important to evaluate all those questions in context of each other. I don’t know the (full) answer to the question (1). But I know the answer to (2) and a way to connect it to (1). And I believe this connection makes it easier to figure out (1).
The point of my idea is that “human (meta-)ethics” is just a subset of a way broader topic. You can learn a lot about human ethics and the way humans expect you to fulfill their wishes before you encounter any humans or start to think about “values”. So, we can replace the questions “how to encode human values?” and even “how to learn human values?” with more general questions “how to learn (properties of systems)?” and “how to translate knowledge about (properties of systems) to knowledge about human values?”
In your proposal about normativity you do a similar “trick”:
You say that we can translate the method of learning language into a method of learning human values. (But language can be as complicated as human values themselves and you don’t say that we can translate results of learning a language into moral rules.)
I say that we can translate the method of learning properties of simple systems into a method of learning human values (a complicated system). And I say that we can translate results of learning those simple systems into human moral rules. And that there’re analogies of many important complicated properties (such as “corrigibility”) in simple systems.
So, I think this frame has a potential to make the problem a lot easier. Many approaches assume that you should start with learning the complicated system (values) and there’s nothing else you can do.
It seems like any proponent of #2 (human feedback, aka, value learning) would already agree with this idea; whereas your post gave me the sense that you think something more radical is here.
In a way my idea is more radical: we don’t start with encoding human values, but we don’t start with “value learning” either.
I remember some work a few years ago on this approach—specifically, using the built environment of humans (together with an assumption that humans are fairly good at optimizing for their own preferences) to infer human values. Sadly, I’m unable to find a reference; maybe it was never published? (Probably I’ve just forgotten the relevant keywords to search for)
I think it’s a different approach, because we don’t have to start with human values (we could start with trying to fix universal AI “bugs”) and we don’t have to assume optimization.
My best guess about how you want to combine #1 and #2 with #3 is that you want to infer the proper value of things from the environment. EG, if most gold sits around in vaults, then the value of gold is probably tied to sitting around in vaults.
I explained how I want to combine those in the context of “What do we gain by caring about system properties?” question.
Now to the “How does AI learn to care about (reward) system properties?” question. Here I don’t have clear answers, only ideas. But I believe it’s a simpler question (compared to the one about human values). The AI needs to do two things:
Learn properties of systems. (Starting with very simple systems.)
Translate properties between different systems.
Maybe it’s useful to split the knowledge about systems into 3 parts:
Absolute knowledge: e.g. “taking absolute control of the system will destroy its (X) property”, “destroying the (X) property of the system may be bad”. This knowledge connects abstract actions to simple facts and tautologies.
Experience of many systems: e.g. “destroying the (X) property of this system is likely to be bad because it’s bad for many other systems” or “destroying (X) is likely to be bad because I’m 90% sure human doesn’t ask me to do the type of task where destroying (X) is allowed”.
Biases of a specific system: e.g. “for this specific system, “absolute control” means controlling about 90% of it”. This knowledge maps abstract actions/facts onto the structure of a specific system.
I have no idea how to learn 1 and 2. But I have an idea about 3 and/or a way to make 3 equivalent to 1 and 2. A way to make “learning biases” somewhat equivalent to “learning properties of systems”. Here I’m trying to do the same trick I did before: split a question, find the easier part, attack the harder part through the easier one.
Take a system (e.g. “movement of people”). Model simplified versions of this system on multiple levels (e.g. “movement of groups” and “movement of individuals”). Take a property of the system (e.g. “freedom of movement”). Describe a biased aggregation of this property on different levels. Choose actions that don’t violate this aggregation.
Take an element of the system (e.g. “sweets”) and its properties (e.g. “you can eat sweets, destroy sweets, ignore sweets...”). Describe other elements in terms of this element. Choose actions that don’t contradict this description.
I believe there’s a somewhat Bayesian-like way to think about this.
EG, if most gold sits around in vaults, then the value of gold is probably tied to sitting around in vaults.
I want to give a specific example with a simple system. I’m not saying this example shows how to solve everything, it just lets to illustrate some ideas.
AI is in a room with 5 coins (in a video game). Each coin gives some reward points and respawns after some time. AI needs to collect 100 reward points.
AI models the system (“coins”) on two levels: “a single coin” (level 1) and “multiple coins” (level 2).
AI finds a glitch to keep respawning a single coin very fast. AI gets “punished” for this.
AI thinks “it’s probably because I was getting reward only from level 1”. So, now the AI tries to balance glitching and collecting multiple coins.
AI finds another bug: a way to modify values of coins. But AI doesn’t try to make the value of a single coin to be 100 points. Because it would probably be the same mistake: accumulating all reward on the level 1.
If we don’t discuss 100% answers, it’s very important to evaluate all those questions in context of each other. I don’t know the (full) answer to the question (1). But I know the answer to (2) and a way to connect it to (1). And I believe this connection makes it easier to figure out (1).
I agree with the overall argument structure to some extent. IE, in general, we should separate the question of what we gain from X from the question of how to achieve it, and not having answered one of those questions should not block us from considering the other.
However, to me, your “what do we gain” claims are already established to be quite large. In the dialogues (about candy and movement), it seems like the idea is that everything works out nicely, in full generality. You aren’t just claiming a few good properties; you seem to be saying “and so on”.
(To be more specific to avoid confusion, you aren’t only claiming that valuing candy doesn’t result in killing humans or hacking human values. You also seem to be saying that valuing candy in this way wouldn’t throw away any important aspect of human values at all. The candy-AI wouldn’t set human quality of life to dirt-poor levels, even if it were instrumentally useful for diverting resources to ensure the daily availability of candy. The AI also wouldn’t allow a preventable hostile invasion by candy-loving aliens-which-count-as-humans-by-some-warped-definition. etc etc etc)
Therefore, in this particular case, I have relatively little interest in further elaborating the “what do we gain” side of things. The “how are we supposed to gain it” question seems much more urgent and worthy of discussion.
To use an analogy, if you told me that they knew a quick way to make $20, I might ask “why are we so worried about getting $20?”. But if you tell me you know a quick way to make a billion dollars, I’m going to be much less interested in the “why” question and much more interested in the “how” question.
I don’t know the (full) answer to the question (1). But I know the answer to (2) and a way to connect it to (1). And I believe this connection makes it easier to figure out (1).
TBH, I don’t really believe this is true, because I don’t think you’ve pinned down what “this” even is. IE, we can expand your set of two questions into three:
How do we get X?
What is X good for?
What is X, even?
You’ve labeled X with terms like “reward economics” and “money system”, but you haven’t really defined those things. So your arguments about what we can gain from them are necessarily vague. As I mentioned before, the general idea of assigning a value (price) to everything is fully compatible with utility theory, but obviously you also further claim that your approach is not identical to utility theory. I hope this point helps illustrate why I feel your terms are still not sufficiently defined.
(My earlier question took the form of “how do we get X”, but really, that’s because I was replying to a specific point rather than starting at the beginning. What I most need to understand better at the moment is ‘what is X, even?’.)[1]
The point of my idea is that “human (meta-)ethics” is just a subset of a way broader topic. You can learn a lot about human ethics and the way humans expect you to fulfill their wishes before you encounter any humans or start to think about “values”. So, we can replace the questions “how to encode human values?” and even “how to learn human values?” with more general questions “how to learn (properties of systems)?” and “how to translate knowledge about (properties of systems) to knowledge about human values?”
We have already to some extent replaced the question “how do you learn human values?” with the question “how do we robustly point at anything external to the system, at all?”. One variation of this which we often consider is “how can a system reliably parse reality into objects”—this is like John Wentworth’s natural abstraction program.
I don’t know whether you think this is at all in the right direction (I’m not trying to claim it’s identical to your approach or anything like that), but it currently seems to me more concrete and well-defined than your “how to learn properties of systems”.
with more general questions “how to learn (properties of systems)?”
The way you bracket this suggests to me that you think “how to learn” is already a fair summary, and “properties of systems” is actually pointing at something extremely general. Like, maybe “properties of systems” is really a phrase that encompasses everything you can learn?
If this were the correct interpretation of your words, then my response would be: I’m not going to claim that we’ve entirely mastered learning, but it seems surprising to claim that studying how we learn about the properties of very simple systems (systems that we can already learn quite easily using modern ML?) would be the key.
In your proposal about normativity you do a similar “trick”
[...]
I say that we can translate the method of learning properties of simple systems into a method of learning human values (a complicated system).
Since you are relating this to my approach: I would say that the critical difference, for me, is precisely the human involvement (or more generally, the involvement of many capable agents). This creates social equilibria (and non-equilibrium behaviors) which form the core of normativity.
An abstract decision-theoretic agent has no norms and no need for norms, in part because it treats its environment as nonliving, nonthingking, and entirely external. A single person existing over time already has a need for norms, because coordinating with yourself over time is hard.
But any system which contains agents is not “simple”. Or at least, I don’t understand the sense in which it is simple.
I think it’s a different approach, because we don’t have to start with human values (we could start with trying to fix universal AI “bugs”) and we don’t have to assume optimization.
I don’t understand what you mean about not assuming optimization. But, I would object that the approach I mentioned (learning values from the environment) doesn’t need to “start with human values” either. Hypothetically, you could try an approach like this with no preconceived concept of “human” at all; you just make a generic assumption that the environments you encounter have been optimized to a significant extent (by some as-yet-unknown actor).
Notably, this approach would have the obvious risk of the AI deciding that too many of the properties of the current world are “good” (for example, people dying, people suffering). On my understanding, your current proposal also suffers from this critique. (You make lots of arguments about how your ideas might help the AI to decide not to change things about the world; you make few-to-no arguments about such an AI deciding to actually improve the world in some way. Well, on my understanding so far.)
However, not killing all humans is such a big win that we can ignore small issues like that for now. Returning to my earlier analogy, the first question that occurs to me is where the billion dollars is coming from, not whether the billion will be enough.
I explained how I want to combine those in the context of “What do we gain by caring about system properties?” question.
In the context you’re replying to, I was trying to propose more concrete ideas for your consideration, as opposed to reiterating what you said.
Here I’m trying to do the same trick I did before: split a question, find the easier part, attack the harder part through the easier one.
Although this will be appropriate (even necessary!) in some cases, the trick is a dangerous one in general. Often you want to tackle the harder sub-problems first, so that you fail as soon as possible. Otherwise, you can spend years on a research program that splits off the easiest fractions of your grand plan, only to realize later that the harder parts of your plan were secretly impossible. So the strategy sets you up to potentially waste a lot of time!
Maybe it’s useful to split the knowledge about systems into 3 parts:
Absolute knowledge: e.g. “taking absolute control of the system will destroy its (X) property”, “destroying the (X) property of the system may be bad”. This knowledge connects abstract actions to simple facts and tautologies.
Experience of many systems: e.g. “destroying the (X) property of this system is likely to be bad because it’s bad for many other systems” or “destroying (X) is likely to be bad because I’m 90% sure human doesn’t ask me to do the type of task where destroying (X) is allowed”.
Biases of a specific system: e.g. “for this specific system, “absolute control” means controlling about 90% of it”. This knowledge maps abstract actions/facts onto the structure of a specific system.
I don’t really understand the motivation behind this division, but, it sounds to me like you require normative feedback to learn these types of things. You keep saying things like “is likely to be bad” and “is likely to be good”. But it’s difficult to see how to derive ideas about “bad” and “good” from pure observation with no positive/negative feedback.
Take a system (e.g. “movement of people”). Model simplified versions of this system on multiple levels (e.g. “movement of groups” and “movement of individuals”). Take a property of the system (e.g. “freedom of movement”). Describe a biased aggregation of this property on different levels. Choose actions that don’t violate this aggregation.
I don’t understand much of what is going on in this paragrah.
Take an element of the system (e.g. “sweets”) and its properties (e.g. “you can eat sweets, destroy sweets, ignore sweets...”). Describe other elements in terms of this element. Choose actions that don’t contradict this description.
It sounds to me like you are trying to cross the is/ought divide—first the ai learns descriptive facts about a system, and then, the ai is supposed to derive normative principles (action-choice principles) from those descriptive facts. Is that an accurate assesment?
One concern I have is that if the description is accurate enough, then it seems like it should either (a) not constrain action, because you’ve learned the true invariant properties of the system which can never be violated (eg, the true laws of physics); or, on the other hand, (b) constrain action for the entirely wrong reasons.
An example of (b) would be if the learning algorithm learns enough to fully constrain actions, based on patterns in the AI actions so far. Since the AI is part of any system it is interacting with, it’s difficult to rule out the AI learning its own patterns of action. But it may do this early, based on dumb patterns of action. Furthermore, it may misgeneralize the actions so far, “wrongly” thinking that it takes actions based on some alien decision procedure. Such a hypothesis will never be ruled out in the future, and indeed is liable to be confirmed, since the AI will make its future acts conform to the rules as it understands them.
AI models the system (“coins”) on two levels: “a single coin” (level 1) and “multiple coins” (level 2).
I don’t really understand what it means to model the system on each of these levels, which harms my understanding of the rest of this argument. (“How can you model the system as a single coin?”)
My attempt to translate things into terms I can understand is: the AI has many hypotheses about what is good. Some of these hypotheses would encourage the AI to exploit glitches. However, human feedback about what’s good has steered the system away from some glitch-exploits in the past. The AI probabilistically generalizes this idea, to avoid exploiting behaviors of the system which seem “glitch-like” according to its understanding.
But, this interpretation seems to be a straightforward value-learning approach, while you claim to be pointing at something beyond simple value learning ideas.
After finishing this long comment, I noticed the inconsistency: I continue to ask “how do we get X?” type questions rather than “what is X?” type questions. In retrospect, I don’t like my “billion dollars” analogy as much as I did when I first wrote it. Part of the problem is that when “X” is still fuzzy, it can shift locations in the causal chain as we focus on different aspects of the conversation. So for example, X could point to the “money system”, or X could end up pointing to some desirable properties which are upstream/downstream of “money systems”. But as X shifts up/downstream, there are some Y which switch between “how-relevant” and “why-relevant”. (Things that are upstream of X are how-relevant; things that are downstream of X are why-relevant.) So it doesn’t make sense for me to keep mentioning that I’m more interested in how-questions than why-questions, when I’m not sure exactly where the definition of X will sit in the causal chain. I should, at best, have some other reasons for not being very interested in certain questions. But I don’t want to re-write the relevant portions of what I wrote. It still represents my epistemic state better than not having written it.
Although this will be appropriate (even necessary!) in some cases, the trick is a dangerous one in general. Often you want to tackle the harder sub-problems first, so that you fail as soon as possible. Otherwise, you can spend years on a research program that splits off the easiest fractions of your grand plan, only to realize later that the harder parts of your plan were secretly impossible. So the strategy sets you up to potentially waste a lot of time!
I think we have slightly different tricks in mid: I’m thinking about a trick that any idea does. It’s like solving an equation with an unknown: doesn’t matter what you do, you split and recombine it in some way.
Or you could compare it to Iterated Distillation and Amplification: when you try to repeat the content of a more complicated thing in a simpler thing.
Or you could compare it to scientific theories: Science still haven’t answered “why things move?”, but it split the question into subatomic pieces.
So, with this strategy the smaller piece you cut, the better. Because we’re not talking about independent pieces.
TBH, I don’t really believe this is true, because I don’t think you’ve pinned down what “this” even is.
You’ve labeled X with terms like “reward economics” and “money system”, but you haven’t really defined those things. So your arguments about what we can gain from them are necessarily vague.
I think definition doesn’t matter for (not) believing in this. And it’s specific enough without a definition. I believe this:
There exist similar statements outside of human ethics/values which can be easily charged with human ethics/values. Let’s call them “X statements”. An X statement is “true” when it’s true for humans.
X statements are more fine-grained and specific than moral statements, but equally broad. Which means “for 1 moral statement there are 10 true X statements” (numbers are arbitrary) or “for 1 example of a human value there are 10 examples of an X statement being true” or “for 10 different human values there are 10 versions of the same X statement” or “each vague moral statement corresponds to a more specific X statement”. X statements have higher “connectivity”.
To give an example of a comparison between moral and X statements:
“Human asked you to make paperclips. Would you turn the human into paperclips? Why not?”
Goal statement: “not killing the human is a part of my goal”.
Moral statements: “because life/personality/autonomy/consent is valuable”. (what is “life/personality/autonomy/consent”?)
X statements: “if you kill, you give the human less than human asked”, “destroying the causal reason of your task is often meaningless”, “inanimate objects can’t be worth more than lives in many economies”, “it’s not the type of task where killing would be an option”, “killing humans destroys the value of paperclips: humans use them”, “reaching states of no return often should be avoided” (Impact Measures).
X statements are applicable outside of human ethics/values, there’s more of them and they’re more specific, especially in context of each other. (meanwhile values can be hopeless to define: you don’t even know where to start in defining values and adding more values only makes everything more complicated)
To not believe in my idea/consider it “too vague” you need to deny the similarity between X statements or deny their properties.
But I think the idea of X statements should be acknowledged anyway. At least as a hypothetical possibility.
...
Here are some answers to questions and thoughts from your reply:
I didn’t understand your answer about normativity (involvement of agents), but I wanted to say this: I believe X statements are more fine-grained and specific (but equally broad) compared to statements about normativity.
Yes, we need human feedback to “charge” X statements with our values and ethics. But X statements are supposed to be more easily charged compared to other things.
X statements don’t abolish the is/ought divide, but they’re supposed to narrow it down.
Maybe X statements are compatible with utility theory and can be expressed in it. But it doesn’t mean that “utility theory statements” have the same good properties. The same way you could try to describe intuitions about ethics using precise goals, but “intuitions” have better properties.
You can apply value learning methods outside of human ethics/values, but it doesn’t mean that “value learning statements” have the same good properties as X statements. That’s one reason to divide “How do we learn this?” and “What do we gain by learning it?” questions.
I didn’t understand upstream/downstream and “how-relevant”/”why-relevant” distinctions, but I hope I answered enough for now.
We have already to some extent replaced the question “how do you learn human values?” with the question “how do we robustly point at anything external to the system, at all?”. One variation of this which we often consider is “how can a system reliably parse reality into objects”—this is like John Wentworth’s natural abstraction program.
I don’t know whether you think this is at all in the right direction (I’m not trying to claim it’s identical to your approach or anything like that), but it currently seems to me more concrete and well-defined than your “how to learn properties of systems”.
I think X statements have better properties compared to “statements about external objects”. And it’s easier to distinguish external objects from internal objects using X statements. Because internal objects have many weird properties.
I described the idea of the X statements. But those statements need to be described in some language or created by some process. I have some ideas about this language/process. And my answers below are mostly about the language/process:
I don’t really understand the motivation behind this division, but, it sounds to me like you require normative feedback to learn these types of things.
The division was for splitting and recombining parts of is–ought problem:
To even think/care that “harming people may be bad” the AI needs to be able to form such statements in its moral core.
To verify if harming people is bad or not the AI needs a channel of feedback that can reach its moral core.
When AI already verified that “harming people is bad” it needs to understand “how much “harm” is considered as harm?” Abstract statements may need some fine-tuning to fit the real world.
I think we can make point 3 equivalent to the points 2 and 1: we can make fine-tuning of abstract “ought” statements equivalent to forming them. Or something to that extent.
Take a system (e.g. “movement of people”). Model simplified versions of this system on multiple levels (e.g. “movement of groups” and “movement of individuals”). Take a property of the system (e.g. “freedom of movement”). Describe a biased aggregation of this property on different levels. Choose actions that don’t violate this aggregation.
I don’t understand much of what is going on in this paragraph.
It’s a restatement of the “Motion is the fundamental value” thought experiment. You have an environment with many elements on different scales (e.g. micro- and macro- organisms). Those elements have a property: they have freedom of movement. This property exists on different scales (e.g. microorganisms do both small scale and large scale movement).
The “fundamental value” of this environment is described by an aggregation of this property over multiple scales. To learn this value means to learn how it’s distributed over different scales of the environment.
I don’t really understand what it means to model the system on each of these levels, which harms my understanding of the rest of this argument. (“How can you model the system as a single coin?”)
Sorry for the confusion. Maybe it’s better to say that AI cuts its model of the environment into multiple scales. A single coin (taking a single coin) is the smallest scale.
My attempt to translate things into terms I can understand is: the AI has many hypotheses about what is good. Some of these hypotheses would encourage the AI to exploit glitches. However, human feedback about what’s good has steered the system away from some glitch-exploits in the past. The AI probabilistically generalizes this idea, to avoid exploiting behaviors of the system which seem “glitch-like” according to its understanding.
Yes, the AI has hypotheses, but those hypotheses should have specific properties. Those properties is the key part.
“I should avoid behavior which seems glitch-like” hypothesis has awful properties: it can’t be translated into human ethics (when AI grows up) and may age like milk when AI becomes smarter and “glitch-like” notion changes.
A process that generates such hypotheses doesn’t generate X statements.
An example of (b) would be if the learning algorithm learns enough to fully constrain actions, based on patterns in the AI actions so far. Since the AI is part of any system it is interacting with, it’s difficult to rule out the AI learning its own patterns of action. But it may do this early, based on dumb patterns of action. Furthermore, it may misgeneralize the actions so far, “wrongly” thinking that it takes actions based on some alien decision procedure. Such a hypothesis will never be ruled out in the future, and indeed is liable to be confirmed, since the AI will make its future acts conform to the rules as it understands them.
Could you give a specific example? If I understand correctly: AI destroys some paintings while doing something and learns that “paintings are things you can destroy for no reason”. I want to note that human feedback is allowed.
(I probably didn’t understand some parts of your comment.) My point isn’t connected with abstract physics, I guess. But maybe you can use “information movements” to formulate my idea. When AI does “reward hacking” it significantly alters the information flow inside of its reward system. When AI solves a task via deception—it alters the “information flow” between itself and the human. And the good thing is that most tasks assume the same types of “information flow”. So you can specify what types of “information flow” are good and bad. Another good thing is that all of this isn’t directly connected to human values, so you don’t have to encode “absolute understanding of human values” in the AI. By the way, I believe that there’s a special type of reasoning for dealing with those “information flows”.
The topic of information flows is also relevant to “Eliciting Latent Knowledge” problem. In the problem you need to create the right type of information flow between the model and the reporter, avoid the negative change of the information flow.
AI needs to care about a system (and some of it properties) in the outside world, a system it learns about. I compared my idea to gradient descent a couple of times in the post, directly (“Comparing Alignment ideas”) and indirectly (by connecting it to other Alignment proposals).
I don’t get this part, at all. (But I didn’t understand the purpose/implications of most parts of the OP.)
Why doesn’t the AI have to understand human values, in your proposal?
In the OP, you state:
From the rest of your post, it seems clear that “proper value” means something like “value to humans”. So it sure seems to me like the AI needs to understand human values in order to implement this kind of check.
I checked out some of your posts (haven’t read 100% of them): Learning Normativity: A Research Agenda and Non-Consequentialist Cooperation?
You draw a distinction between human values and human norms. For example, an AI can respect someone’s autonomy before the AI gets to know their values and the exact amount of autonomy they want.
I draw the same distinction, but more abstract. It’s a distinction between human values and properties of any system/task. AI can respect keeping some properties of its reward systems intact before it gets to know human values.
I think even in very simple games an AI could learn important properties of systems. Which would significantly help the AI to respect human values.
Here’s the shortest formulation of my idea:
You can split possible effects of AI’s actions into three domains. All of them are different (with different ideas), even though they partially intersect and can be formulated in terms of each other. Traditionally we focus on the first two domains:
(Not) accomplishing a goal. Utility functions are about this.
(Not) violating human values. Models of human feedback are about this.
(Not) modifying a system without breaking it. Impact measures are about this.
My idea is about combining all of this (mostly 2 and 3) into a single approach. Or generalizing ideas for the third domain. There isn’t a lot of ideas for the third one, as far as I know. Maybe people are not aware enough about that domain.
I meant that some AIs need to start with understanding human values (perfectly) and others don’t. Here’s an analogy:
Imagine a person who respects laws. She ends up in a foreign country. She looks up the laws. She respects and doesn’t break them. She has an abstract goal that depends on what she learns about the world.
Imagine a person who respects “killing people”. She ends up in a foreign country. She looks up the laws. She doesn’t break them for some time. She accumulates power. Then she breaks all the laws and kills everyone. She has a particular goal that doesn’t depend on anything she learns.
The point of my idea is to create an AI that respects abstract laws of systems, abstract laws of tasks. The AI of the 1st type. (Of course, in reality the distinction isn’t black and white, but the difference still exists.)
This is just my intuition, but it seems like the core intuition of a “money system” as you use it in the post is the same as the core intuition behind utility functions (ie, everything must have a price ≈ everything must have a quantifiable utility).
In utility-theoretic terms, this is like saying that money is an instrumental goal, not a terminal goal. Or at least, money as-terminal-goal has a low weight compared to other things (eg, human lives). Or perhaps more faithful to what you want: money as-terminal-goal is dependent on a context.[1][2]
So it seems to me like this still faces the same basic challenges as most other approaches, IE, making the system robustly care about external objects which we can’t get perfect feedback about. How do you get it to care about the context? How do you get it to think killing humans is “expensive”? How do you ask the system to “make paperclips that have the value of paperclips for humans”?
It seems like any proponent of #2 (human feedback, aka, value learning) would already agree with this idea; whereas your post gave me the sense that you think something more radical is here.
Reiterating the quote from the OP that I quoted before:
My best guess about how you want to combine #1 and #2 with #3 is that you want to infer the proper value of things from the environment. EG, if most gold sits around in vaults, then the value of gold is probably tied to sitting around in vaults.
I remember some work a few years ago on this approach—specifically, using the built environment of humans (together with an assumption that humans are fairly good at optimizing for their own preferences) to infer human values. Sadly, I’m unable to find a reference; maybe it was never published? (Probably I’ve just forgotten the relevant keywords to search for)
The distinction between instrumental goals vs “terminal goals that depend on some context” is rather blurry, because the way we distinguish between terminal and instrumental goals (from the outside, behaviorally) is how much they vary based on context. (EG, if I take away the other basketball players, the audience, and the money, will one basketball player still try to perform a slam dunk?)
One reason for abandoning utility functions is, perhaps, an instinct that everything must be instrumental, because nothing is truly terminal. I discussed how to do this while keeping most of expected utility theory in An Orthodox Case Against Utility Functions.
The AI doesn’t have to know the precise price of everything. The AI needs to make sure that a price doesn’t break the desired properties of a system. If paperclips are worth more than everything else in the universe, it would destroy almost any system. So, this price is unlikely to be good.
There are two questions to ask:
How does the AI learn to care about this?
What do we gain by making the AI care about this?
If we don’t discuss 100% answers, it’s very important to evaluate all those questions in context of each other. I don’t know the (full) answer to the question (1). But I know the answer to (2) and a way to connect it to (1). And I believe this connection makes it easier to figure out (1).
The point of my idea is that “human (meta-)ethics” is just a subset of a way broader topic. You can learn a lot about human ethics and the way humans expect you to fulfill their wishes before you encounter any humans or start to think about “values”. So, we can replace the questions “how to encode human values?” and even “how to learn human values?” with more general questions “how to learn (properties of systems)?” and “how to translate knowledge about (properties of systems) to knowledge about human values?”
In your proposal about normativity you do a similar “trick”:
You say that we can translate the method of learning language into a method of learning human values. (But language can be as complicated as human values themselves and you don’t say that we can translate results of learning a language into moral rules.)
I say that we can translate the method of learning properties of simple systems into a method of learning human values (a complicated system). And I say that we can translate results of learning those simple systems into human moral rules. And that there’re analogies of many important complicated properties (such as “corrigibility”) in simple systems.
So, I think this frame has a potential to make the problem a lot easier. Many approaches assume that you should start with learning the complicated system (values) and there’s nothing else you can do.
In a way my idea is more radical: we don’t start with encoding human values, but we don’t start with “value learning” either.
I think it’s a different approach, because we don’t have to start with human values (we could start with trying to fix universal AI “bugs”) and we don’t have to assume optimization.
I explained how I want to combine those in the context of “What do we gain by caring about system properties?” question.
Now to the “How does AI learn to care about (reward) system properties?” question. Here I don’t have clear answers, only ideas. But I believe it’s a simpler question (compared to the one about human values). The AI needs to do two things:
Learn properties of systems. (Starting with very simple systems.)
Translate properties between different systems.
Maybe it’s useful to split the knowledge about systems into 3 parts:
Absolute knowledge: e.g. “taking absolute control of the system will destroy its (X) property”, “destroying the (X) property of the system may be bad”. This knowledge connects abstract actions to simple facts and tautologies.
Experience of many systems: e.g. “destroying the (X) property of this system is likely to be bad because it’s bad for many other systems” or “destroying (X) is likely to be bad because I’m 90% sure human doesn’t ask me to do the type of task where destroying (X) is allowed”.
Biases of a specific system: e.g. “for this specific system, “absolute control” means controlling about 90% of it”. This knowledge maps abstract actions/facts onto the structure of a specific system.
I have no idea how to learn 1 and 2. But I have an idea about 3 and/or a way to make 3 equivalent to 1 and 2. A way to make “learning biases” somewhat equivalent to “learning properties of systems”. Here I’m trying to do the same trick I did before: split a question, find the easier part, attack the harder part through the easier one.
How to make “learning biases” somewhat equivalent to “learning properties of systems”? I have those vague ideas (“Thought experiments. Recap” and “Rationality misses something?”) from the post:
Take a system (e.g. “movement of people”). Model simplified versions of this system on multiple levels (e.g. “movement of groups” and “movement of individuals”). Take a property of the system (e.g. “freedom of movement”). Describe a biased aggregation of this property on different levels. Choose actions that don’t violate this aggregation.
Take an element of the system (e.g. “sweets”) and its properties (e.g. “you can eat sweets, destroy sweets, ignore sweets...”). Describe other elements in terms of this element. Choose actions that don’t contradict this description.
I believe there’s a somewhat Bayesian-like way to think about this.
I want to give a specific example with a simple system. I’m not saying this example shows how to solve everything, it just lets to illustrate some ideas.
AI is in a room with 5 coins (in a video game). Each coin gives some reward points and respawns after some time. AI needs to collect 100 reward points.
AI models the system (“coins”) on two levels: “a single coin” (level 1) and “multiple coins” (level 2).
AI finds a glitch to keep respawning a single coin very fast. AI gets “punished” for this.
AI thinks “it’s probably because I was getting reward only from level 1”. So, now the AI tries to balance glitching and collecting multiple coins.
AI finds another bug: a way to modify values of coins. But AI doesn’t try to make the value of a single coin to be 100 points. Because it would probably be the same mistake: accumulating all reward on the level 1.
I agree with the overall argument structure to some extent. IE, in general, we should separate the question of what we gain from X from the question of how to achieve it, and not having answered one of those questions should not block us from considering the other.
However, to me, your “what do we gain” claims are already established to be quite large. In the dialogues (about candy and movement), it seems like the idea is that everything works out nicely, in full generality. You aren’t just claiming a few good properties; you seem to be saying “and so on”.
(To be more specific to avoid confusion, you aren’t only claiming that valuing candy doesn’t result in killing humans or hacking human values. You also seem to be saying that valuing candy in this way wouldn’t throw away any important aspect of human values at all. The candy-AI wouldn’t set human quality of life to dirt-poor levels, even if it were instrumentally useful for diverting resources to ensure the daily availability of candy. The AI also wouldn’t allow a preventable hostile invasion by candy-loving aliens-which-count-as-humans-by-some-warped-definition. etc etc etc)
Therefore, in this particular case, I have relatively little interest in further elaborating the “what do we gain” side of things. The “how are we supposed to gain it” question seems much more urgent and worthy of discussion.
To use an analogy, if you told me that they knew a quick way to make $20, I might ask “why are we so worried about getting $20?”. But if you tell me you know a quick way to make a billion dollars, I’m going to be much less interested in the “why” question and much more interested in the “how” question.
TBH, I don’t really believe this is true, because I don’t think you’ve pinned down what “this” even is. IE, we can expand your set of two questions into three:
How do we get X?
What is X good for?
What is X, even?
You’ve labeled X with terms like “reward economics” and “money system”, but you haven’t really defined those things. So your arguments about what we can gain from them are necessarily vague. As I mentioned before, the general idea of assigning a value (price) to everything is fully compatible with utility theory, but obviously you also further claim that your approach is not identical to utility theory. I hope this point helps illustrate why I feel your terms are still not sufficiently defined.
(My earlier question took the form of “how do we get X”, but really, that’s because I was replying to a specific point rather than starting at the beginning. What I most need to understand better at the moment is ‘what is X, even?’.)[1]
We have already to some extent replaced the question “how do you learn human values?” with the question “how do we robustly point at anything external to the system, at all?”. One variation of this which we often consider is “how can a system reliably parse reality into objects”—this is like John Wentworth’s natural abstraction program.
I don’t know whether you think this is at all in the right direction (I’m not trying to claim it’s identical to your approach or anything like that), but it currently seems to me more concrete and well-defined than your “how to learn properties of systems”.
The way you bracket this suggests to me that you think “how to learn” is already a fair summary, and “properties of systems” is actually pointing at something extremely general. Like, maybe “properties of systems” is really a phrase that encompasses everything you can learn?
If this were the correct interpretation of your words, then my response would be: I’m not going to claim that we’ve entirely mastered learning, but it seems surprising to claim that studying how we learn about the properties of very simple systems (systems that we can already learn quite easily using modern ML?) would be the key.
Since you are relating this to my approach: I would say that the critical difference, for me, is precisely the human involvement (or more generally, the involvement of many capable agents). This creates social equilibria (and non-equilibrium behaviors) which form the core of normativity.
An abstract decision-theoretic agent has no norms and no need for norms, in part because it treats its environment as nonliving, nonthingking, and entirely external. A single person existing over time already has a need for norms, because coordinating with yourself over time is hard.
But any system which contains agents is not “simple”. Or at least, I don’t understand the sense in which it is simple.
I don’t understand what you mean about not assuming optimization. But, I would object that the approach I mentioned (learning values from the environment) doesn’t need to “start with human values” either. Hypothetically, you could try an approach like this with no preconceived concept of “human” at all; you just make a generic assumption that the environments you encounter have been optimized to a significant extent (by some as-yet-unknown actor).
Notably, this approach would have the obvious risk of the AI deciding that too many of the properties of the current world are “good” (for example, people dying, people suffering). On my understanding, your current proposal also suffers from this critique. (You make lots of arguments about how your ideas might help the AI to decide not to change things about the world; you make few-to-no arguments about such an AI deciding to actually improve the world in some way. Well, on my understanding so far.)
However, not killing all humans is such a big win that we can ignore small issues like that for now. Returning to my earlier analogy, the first question that occurs to me is where the billion dollars is coming from, not whether the billion will be enough.
In the context you’re replying to, I was trying to propose more concrete ideas for your consideration, as opposed to reiterating what you said.
Although this will be appropriate (even necessary!) in some cases, the trick is a dangerous one in general. Often you want to tackle the harder sub-problems first, so that you fail as soon as possible. Otherwise, you can spend years on a research program that splits off the easiest fractions of your grand plan, only to realize later that the harder parts of your plan were secretly impossible. So the strategy sets you up to potentially waste a lot of time!
I don’t really understand the motivation behind this division, but, it sounds to me like you require normative feedback to learn these types of things. You keep saying things like “is likely to be bad” and “is likely to be good”. But it’s difficult to see how to derive ideas about “bad” and “good” from pure observation with no positive/negative feedback.
I don’t understand much of what is going on in this paragrah.
It sounds to me like you are trying to cross the is/ought divide—first the ai learns descriptive facts about a system, and then, the ai is supposed to derive normative principles (action-choice principles) from those descriptive facts. Is that an accurate assesment?
One concern I have is that if the description is accurate enough, then it seems like it should either (a) not constrain action, because you’ve learned the true invariant properties of the system which can never be violated (eg, the true laws of physics); or, on the other hand, (b) constrain action for the entirely wrong reasons.
An example of (b) would be if the learning algorithm learns enough to fully constrain actions, based on patterns in the AI actions so far. Since the AI is part of any system it is interacting with, it’s difficult to rule out the AI learning its own patterns of action. But it may do this early, based on dumb patterns of action. Furthermore, it may misgeneralize the actions so far, “wrongly” thinking that it takes actions based on some alien decision procedure. Such a hypothesis will never be ruled out in the future, and indeed is liable to be confirmed, since the AI will make its future acts conform to the rules as it understands them.
I don’t really understand what it means to model the system on each of these levels, which harms my understanding of the rest of this argument. (“How can you model the system as a single coin?”)
My attempt to translate things into terms I can understand is: the AI has many hypotheses about what is good. Some of these hypotheses would encourage the AI to exploit glitches. However, human feedback about what’s good has steered the system away from some glitch-exploits in the past. The AI probabilistically generalizes this idea, to avoid exploiting behaviors of the system which seem “glitch-like” according to its understanding.
But, this interpretation seems to be a straightforward value-learning approach, while you claim to be pointing at something beyond simple value learning ideas.
After finishing this long comment, I noticed the inconsistency: I continue to ask “how do we get X?” type questions rather than “what is X?” type questions. In retrospect, I don’t like my “billion dollars” analogy as much as I did when I first wrote it. Part of the problem is that when “X” is still fuzzy, it can shift locations in the causal chain as we focus on different aspects of the conversation. So for example, X could point to the “money system”, or X could end up pointing to some desirable properties which are upstream/downstream of “money systems”. But as X shifts up/downstream, there are some Y which switch between “how-relevant” and “why-relevant”. (Things that are upstream of X are how-relevant; things that are downstream of X are why-relevant.) So it doesn’t make sense for me to keep mentioning that I’m more interested in how-questions than why-questions, when I’m not sure exactly where the definition of X will sit in the causal chain. I should, at best, have some other reasons for not being very interested in certain questions. But I don’t want to re-write the relevant portions of what I wrote. It still represents my epistemic state better than not having written it.
I think we have slightly different tricks in mid: I’m thinking about a trick that any idea does. It’s like solving an equation with an unknown: doesn’t matter what you do, you split and recombine it in some way.
Or you could compare it to Iterated Distillation and Amplification: when you try to repeat the content of a more complicated thing in a simpler thing.
Or you could compare it to scientific theories: Science still haven’t answered “why things move?”, but it split the question into subatomic pieces.
So, with this strategy the smaller piece you cut, the better. Because we’re not talking about independent pieces.
I think definition doesn’t matter for (not) believing in this. And it’s specific enough without a definition. I believe this:
There exist similar statements outside of human ethics/values which can be easily charged with human ethics/values. Let’s call them “X statements”. An X statement is “true” when it’s true for humans.
X statements are more fine-grained and specific than moral statements, but equally broad. Which means “for 1 moral statement there are 10 true X statements” (numbers are arbitrary) or “for 1 example of a human value there are 10 examples of an X statement being true” or “for 10 different human values there are 10 versions of the same X statement” or “each vague moral statement corresponds to a more specific X statement”. X statements have higher “connectivity”.
To give an example of a comparison between moral and X statements:
“Human asked you to make paperclips. Would you turn the human into paperclips? Why not?”
Goal statement: “not killing the human is a part of my goal”.
Moral statements: “because life/personality/autonomy/consent is valuable”. (what is “life/personality/autonomy/consent”?)
X statements: “if you kill, you give the human less than human asked”, “destroying the causal reason of your task is often meaningless”, “inanimate objects can’t be worth more than lives in many economies”, “it’s not the type of task where killing would be an option”, “killing humans destroys the value of paperclips: humans use them”, “reaching states of no return often should be avoided” (Impact Measures).
X statements are applicable outside of human ethics/values, there’s more of them and they’re more specific, especially in context of each other. (meanwhile values can be hopeless to define: you don’t even know where to start in defining values and adding more values only makes everything more complicated)
To not believe in my idea/consider it “too vague” you need to deny the similarity between X statements or deny their properties.
But I think the idea of X statements should be acknowledged anyway. At least as a hypothetical possibility.
...
Here are some answers to questions and thoughts from your reply:
I didn’t understand your answer about normativity (involvement of agents), but I wanted to say this: I believe X statements are more fine-grained and specific (but equally broad) compared to statements about normativity.
Yes, we need human feedback to “charge” X statements with our values and ethics. But X statements are supposed to be more easily charged compared to other things.
X statements don’t abolish the is/ought divide, but they’re supposed to narrow it down.
Maybe X statements are compatible with utility theory and can be expressed in it. But it doesn’t mean that “utility theory statements” have the same good properties. The same way you could try to describe intuitions about ethics using precise goals, but “intuitions” have better properties.
You can apply value learning methods outside of human ethics/values, but it doesn’t mean that “value learning statements” have the same good properties as X statements. That’s one reason to divide “How do we learn this?” and “What do we gain by learning it?” questions.
I didn’t understand upstream/downstream and “how-relevant”/”why-relevant” distinctions, but I hope I answered enough for now.
I think X statements have better properties compared to “statements about external objects”. And it’s easier to distinguish external objects from internal objects using X statements. Because internal objects have many weird properties.
I described the idea of the X statements. But those statements need to be described in some language or created by some process. I have some ideas about this language/process. And my answers below are mostly about the language/process:
The division was for splitting and recombining parts of is–ought problem:
To even think/care that “harming people may be bad” the AI needs to be able to form such statements in its moral core.
To verify if harming people is bad or not the AI needs a channel of feedback that can reach its moral core.
When AI already verified that “harming people is bad” it needs to understand “how much “harm” is considered as harm?” Abstract statements may need some fine-tuning to fit the real world.
I think we can make point 3 equivalent to the points 2 and 1: we can make fine-tuning of abstract “ought” statements equivalent to forming them. Or something to that extent.
It’s a restatement of the “Motion is the fundamental value” thought experiment. You have an environment with many elements on different scales (e.g. micro- and macro- organisms). Those elements have a property: they have freedom of movement. This property exists on different scales (e.g. microorganisms do both small scale and large scale movement).
The “fundamental value” of this environment is described by an aggregation of this property over multiple scales. To learn this value means to learn how it’s distributed over different scales of the environment.
Sorry for the confusion. Maybe it’s better to say that AI cuts its model of the environment into multiple scales. A single coin (taking a single coin) is the smallest scale.
Yes, the AI has hypotheses, but those hypotheses should have specific properties. Those properties is the key part.
“I should avoid behavior which seems glitch-like” hypothesis has awful properties: it can’t be translated into human ethics (when AI grows up) and may age like milk when AI becomes smarter and “glitch-like” notion changes.
A process that generates such hypotheses doesn’t generate X statements.
Could you give a specific example? If I understand correctly: AI destroys some paintings while doing something and learns that “paintings are things you can destroy for no reason”. I want to note that human feedback is allowed.