(Probably a stupid nooby question that won’t help solve alignment)
Suppose you implement a goal in an AI through a reinforcement learning system. Why does the AI really “care” about this goal? Why does it obey? It does because it is punished and/or rewarded, which motivates it to achieve that goal.
Okay. So why does AI really care about punishment and reward in the first place? Why does it follows its implemented goal?
Sentient beings do because they feel pain and pleasure. They have no choice but to care about punishment and reward. They inevitably do it because they feel it. Assuming that our AI does not feel, what is the nature of its system of punishments and rewards? How is it possible to punish or reward a non-sentient agent?
My intuitive response would be “It is just physics. What we call ‘reward’ and ‘punishment’ are just elements of a program forcing an agent to do something”, but I don’t understand how this RL physics is different from that in our carbonic animal brains.
Do Artificial Reinforcement Learners Matter Morally, written by Brian Tomasik, makes the distinction even less obvious for me. What do I miss?
An oversimplified picture of a reinforcement-learning agent (in particular, roughly a Q-learning agent with a single state) could be as follows. A program has two numerical variables: go_left and go_right. The agent chooses to go left or right based on which of these variables is larger. Suppose that go_left is 3 and go_right is 1. The agent goes left. The environment delivers a “reward” of −4. Now go_left gets updated to 3 − 4 = −1 (which is not quite the right math for Q-learning, but ok). So now go_right > go_left, and the agent goes right.
So what you said is exactly correct: “It is just physics. What we call ‘reward’ and ‘punishment’ are just elements of a program forcing an agent to do something”. And I think our animal brains do the same thing: they receive rewards that update our inclinations to take various actions. However, animal brains have lots of additional machinery that simple RL agents lack. The actions we take are influenced by a number of cognitive processes, not just the basic RL machinery. For example, if we were just following RL mechanically, we might keep eating candy for a long time without stopping, but our brains are also capable of influencing our behavior via intellectual considerations like “Too much candy is bad for my health”. It’s possible these intellectual thoughts lead to their own “rewards” and “punishments” that get applied to our decisions, but at least it’s clear that animal brains make choices in very complicated ways compared with barebones RL programs.
You wrote: “Sentient beings do because they feel pain and pleasure. They have no choice but to care about punishment and reward.” The way I imagine it (which could be wrong) is that animals are built with RL machinery (along with many other cognitive mechanisms) and are mechanically driven to care about their rewards in a similar way as a computer program does. They also have cognitive processes for interpreting what’s happening to them, and this interpretive machinery labels some incoming sensations as “good” and some as “bad”. If we ask ourselves why we care about not staying outside in freezing temperatures without a coat, we say “I care because being cold feels bad”. That’s a folk-psychology way to say “My RL machinery cares because being outside in the cold sends rewards of −5 at each time step, and taking the action of going inside changes the rewards to +1. And I have other cognitive machinery that can interpret these −5 and +1 signals as pain and pleasure and understand that they drive my behavior.”
Assuming this account is correct, the main distinction between simple programs and ourselves is one of complexity—how much additional cognitive machinery there is to influence decisions and interpret what’s going on. That’s the reason I argue that simple RL agents have a tiny bit of moral weight. The difference between them and us is one of degree.
Seems to me that there must be more about pain and pleasure than mere −1 and +1 signals, because there are multiple methods how to make some behavior more or less likely. Pain and pleasure is one such option, habits are another option, unconscious biases yet another. Each of them make some behavior more likely and some other behavior less likely, but feel quite differently from inside. Compared to habits and unconscious biases, pain and pleasure have some extra quality because of how they are implemented in our bodies.
The simple RL agents, unless they have the specific circuits to feel pain and pleasure, are in my opinion more analogical to the habits or unconscious biases.
Thanks. :) What do you mean by “unconscious biases”? Do you mean unconscious RL, like how the muscles in our legs might learn to walk without us being aware of the feedback they’re getting? (Note: I’m not an expert on how our leg muscles actually learn to walk, but maybe it’s RL of some sort.) I would agree that simple RL agents are more similar to that. I think these systems can still be considered marginally conscious to themselves, even if the parts of us that talk have no introspective access to them, but they’re much less morally significant than the parts of us that can talk.
Perhaps pain and pleasure are what we feel when getting punishment and reward signals that are particularly important for our high-level brains to pay attention to.
-