Q Home comments on Can “Reward Economics” solve AI Alignment?

Q Home 12 Sep 2022 6:01 UTC
1 point
0
Although this will be appropriate (even necessary!) in some cases, the trick is a dangerous one in general. Often you want to tackle the harder sub-problems first, so that you fail as soon as possible. Otherwise, you can spend years on a research program that splits off the easiest fractions of your grand plan, only to realize later that the harder parts of your plan were secretly impossible. So the strategy sets you up to potentially waste a lot of time!
I think we have slightly different tricks in mid: I’m thinking about a trick that any idea does. It’s like solving an equation with an unknown: doesn’t matter what you do, you split and recombine it in some way.
Or you could compare it to Iterated Distillation and Amplification: when you try to repeat the content of a more complicated thing in a simpler thing.
Or you could compare it to scientific theories: Science still haven’t answered “why things move?”, but it split the question into subatomic pieces.
So, with this strategy the smaller piece you cut, the better. Because we’re not talking about independent pieces.
TBH, I don’t really believe this is true, because I don’t think you’ve pinned down what “this” even is.
You’ve labeled X with terms like “reward economics” and “money system”, but you haven’t really defined those things. So your arguments about what we can gain from them are necessarily vague.
I think definition doesn’t matter for (not) believing in this. And it’s specific enough without a definition. I believe this:
1. There exist similar statements outside of human ethics/values which can be easily charged with human ethics/values. Let’s call them “X statements”. An X statement is “true” when it’s true for humans.
2. X statements are more fine-grained and specific than moral statements, but equally broad. Which means “for 1 moral statement there are 10 true X statements” (numbers are arbitrary) or “for 1 example of a human value there are 10 examples of an X statement being true” or “for 10 different human values there are 10 versions of the same X statement” or “each vague moral statement corresponds to a more specific X statement”. X statements have higher “connectivity”.
To give an example of a comparison between moral and X statements:
“Human asked you to make paperclips. Would you turn the human into paperclips? Why not?”
1. Goal statement: “not killing the human is a part of my goal”.
2. Moral statements: “because life/personality/autonomy/consent is valuable”. (what is “life/personality/autonomy/consent”?)
3. X statements: “if you kill, you give the human less than human asked”, “destroying the causal reason of your task is often meaningless”, “inanimate objects can’t be worth more than lives in many economies”, “it’s not the type of task where killing would be an option”, “killing humans destroys the value of paperclips: humans use them”, “reaching states of no return often should be avoided” (Impact Measures).
X statements are applicable outside of human ethics/values, there’s more of them and they’re more specific, especially in context of each other. (meanwhile values can be hopeless to define: you don’t even know where to start in defining values and adding more values only makes everything more complicated)
To not believe in my idea/consider it “too vague” you need to deny the similarity between X statements or deny their properties.
But I think the idea of X statements should be acknowledged anyway. At least as a hypothetical possibility.
...
Here are some answers to questions and thoughts from your reply:
- I didn’t understand your answer about normativity (involvement of agents), but I wanted to say this: I believe X statements are more fine-grained and specific (but equally broad) compared to statements about normativity.
- Yes, we need human feedback to “charge” X statements with our values and ethics. But X statements are supposed to be more easily charged compared to other things.
- X statements don’t abolish the is/ought divide, but they’re supposed to narrow it down.
- Maybe X statements are compatible with utility theory and can be expressed in it. But it doesn’t mean that “utility theory statements” have the same good properties. The same way you could try to describe intuitions about ethics using precise goals, but “intuitions” have better properties.
- You can apply value learning methods outside of human ethics/values, but it doesn’t mean that “value learning statements” have the same good properties as X statements. That’s one reason to divide “How do we learn this?” and “What do we gain by learning it?” questions.
I didn’t understand upstream/downstream and “how-relevant”/”why-relevant” distinctions, but I hope I answered enough for now.
We have already to some extent replaced the question “how do you learn human values?” with the question “how do we robustly point at anything external to the system, at all?”. One variation of this which we often consider is “how can a system reliably parse reality into objects”—this is like John Wentworth’s natural abstraction program.
I don’t know whether you think this is at all in the right direction (I’m not trying to claim it’s identical to your approach or anything like that), but it currently seems to me more concrete and well-defined than your “how to learn properties of systems”.
I think X statements have better properties compared to “statements about external objects”. And it’s easier to distinguish external objects from internal objects using X statements. Because internal objects have many weird properties.
I described the idea of the X statements. But those statements need to be described in some language or created by some process. I have some ideas about this language/process. And my answers below are mostly about the language/process:
I don’t really understand the motivation behind this division, but, it sounds to me like you require normative feedback to learn these types of things.
The division was for splitting and recombining parts of is–ought problem:
1. To even think/care that “harming people may be bad” the AI needs to be able to form such statements in its moral core.
2. To verify if harming people is bad or not the AI needs a channel of feedback that can reach its moral core.
3. When AI already verified that “harming people is bad” it needs to understand “how much “harm” is considered as harm?” Abstract statements may need some fine-tuning to fit the real world.
I think we can make point 3 equivalent to the points 2 and 1: we can make fine-tuning of abstract “ought” statements equivalent to forming them. Or something to that extent.
Take a system (e.g. “movement of people”). Model simplified versions of this system on multiple levels (e.g. “movement of groups” and “movement of individuals”). Take a property of the system (e.g. “freedom of movement”). Describe a biased aggregation of this property on different levels. Choose actions that don’t violate this aggregation.
I don’t understand much of what is going on in this paragraph.
It’s a restatement of the “Motion is the fundamental value” thought experiment. You have an environment with many elements on different scales (e.g. micro- and macro- organisms). Those elements have a property: they have freedom of movement. This property exists on different scales (e.g. microorganisms do both small scale and large scale movement).
The “fundamental value” of this environment is described by an aggregation of this property over multiple scales. To learn this value means to learn how it’s distributed over different scales of the environment.
I don’t really understand what it means to model the system on each of these levels, which harms my understanding of the rest of this argument. (“How can you model the system as a single coin?”)
Sorry for the confusion. Maybe it’s better to say that AI cuts its model of the environment into multiple scales. A single coin (taking a single coin) is the smallest scale.
My attempt to translate things into terms I can understand is: the AI has many hypotheses about what is good. Some of these hypotheses would encourage the AI to exploit glitches. However, human feedback about what’s good has steered the system away from some glitch-exploits in the past. The AI probabilistically generalizes this idea, to avoid exploiting behaviors of the system which seem “glitch-like” according to its understanding.
Yes, the AI has hypotheses, but those hypotheses should have specific properties. Those properties is the key part.
“I should avoid behavior which seems glitch-like” hypothesis has awful properties: it can’t be translated into human ethics (when AI grows up) and may age like milk when AI becomes smarter and “glitch-like” notion changes.
A process that generates such hypotheses doesn’t generate X statements.
An example of (b) would be if the learning algorithm learns enough to fully constrain actions, based on patterns in the AI actions so far. Since the AI is part of any system it is interacting with, it’s difficult to rule out the AI learning its own patterns of action. But it may do this early, based on dumb patterns of action. Furthermore, it may misgeneralize the actions so far, “wrongly” thinking that it takes actions based on some alien decision procedure. Such a hypothesis will never be ruled out in the future, and indeed is liable to be confirmed, since the AI will make its future acts conform to the rules as it understands them.
Could you give a specific example? If I understand correctly: AI destroys some paintings while doing something and learns that “paintings are things you can destroy for no reason”. I want to note that human feedback is allowed.