There’s no escaping it: After enough backup steps, you’re traveling across the world to do cocaine.
But obviously these conditions aren’t true in the real world.
I think they are a little? Some people do travel to other countries for easier and better drug access. And some people become total drug addicts (perhaps arguably by miscalculating their long-term reward consequences and having too-high a discount rate, oops), while others do a light or medium amount of drugs longer-term.
Lots of people also don’t do this, but there’s a huge amount of information uncertainty, outcome uncertainty, and risk associated with drugs (health-wise, addiction-wise, knowledge-wise, crime-wise, etc), so lots of fairly rational (particularly risk-averse) folks will avoid it.
Button-pressing will perhaps be seen as a socially-unacceptable, risky behavior that can lead to long-term poor outcomes by AI, but I guess the key thing here is that you want, like, exactly zero powerful AIs to ever choose to destroy/disempower humanity in order to wirehead, instead of just a low percentage, so you need them to be particularly risk-averse.
Delicious food is perhaps a better example of wireheading in humans. In this case, it’s not against the law, it’s not that shunned socially, and it is ***absolutely ubiquitous***. In general, any positive chemical feeling we have in our brains (either from drugs or cheeseburgers) can be seen as (often “internally misaligned”) instrumental goals that we are mesa-optimizing. It’s just that some pathways to those feelings are a lot riskier and more uncertain that others.
And I guess this can translate to RL—an RL agent won’t try everything, but if the risk is low and the expectation is high, it probably will try it. If pressing a button is easy and doesn’t conflict with taking out the trash and doing other things it wants to do, it might try it. And as its generalization capabilities increase, its confidence can make this more likely, I think. So you should therefore increasingly train agents to be more risk-averse and less willing to break specific rules and norms as their generalization capabilities increase.
Delicious food does seem like a good (but IMO weak) point in favor of reward-optimization, and pushes up my P(AI cares a lot about reward terminally) a tiny bit. But also note that lots of people (including myself) don’t care very much about delicious food, and it seems like the vast majority of people don’t make their lives primarily about delicious food or other tight correlates of their sensory pleasure circuits.
If pressing a button is easy and doesn’t conflict with taking out the trash and doing other things it wants to do, it might try it.
This is compatible with one intended main point of this essay, which is that while reward optimization might be a convergent secondary goal, it probably won’t be the agent’s primary motivation.
it seems like the vast majority of people don’t make their lives primarily about delicious food
That’s true. There are built-in decreasing marginal returns to eating massive quantities of delicious food (you get full), but we don’t see a huge number of—for example—bulimics who are bulimic for the core purpose of being able to eat more.
However, I’d mention that yummy food is only one of many things that are brains are hard-wired to mesa-optimize for. Social acceptance and social status (particularly within the circles we care about, i.e. usually the circles we are likely to succeed in and get benefit from) are very big examples that much of our behavior can be ascribed to.
reward optimization might be a convergent secondary goal, it probably won’t be the agent’s primary motivation.
So I guess, reflecting this to humans, would you argue that most human’s primary motivations aren’t motivated mostly by various mesa-objectives our brains are hardwired to have? In my mind this is a hard sell, as most things humans do you can trace back (sometimes incorrectly, sure) to some thing that was evolutionary advantageous (mesa-objective that led to genetic fitness). The whole area of evolutionary biology specializes in coming up with (hard to prove and sometimes convoluted) explanations here relating to both our behavior and physiology.
For example, you could argue that us posting hopefully smart things here is giving our brains happy juice relating to social status / intelligence signaling / social interaction, which in our evolutionary history increased the probability that we would find high quality partners to make lots of high quality babies with. I guess, if mesa-objectives aren’t the primary drivers of us humans—what is, and how can you be sure?
Yeah, the food that is served in fast food restaurants, and arguably a lot of society, basically wireheads our reward centers, and to a large extent is why obesity is such a huge problem in the modern era.
Obesity is the first example of real life wireheading, at least in a weak sense. So now that I think about it, I think TurnTrout is too optimistic about RL models not optimizing reward.
I think they are a little? Some people do travel to other countries for easier and better drug access. And some people become total drug addicts (perhaps arguably by miscalculating their long-term reward consequences and having too-high a discount rate, oops), while others do a light or medium amount of drugs longer-term.
Lots of people also don’t do this, but there’s a huge amount of information uncertainty, outcome uncertainty, and risk associated with drugs (health-wise, addiction-wise, knowledge-wise, crime-wise, etc), so lots of fairly rational (particularly risk-averse) folks will avoid it.
Button-pressing will perhaps be seen as a socially-unacceptable, risky behavior that can lead to long-term poor outcomes by AI, but I guess the key thing here is that you want, like, exactly zero powerful AIs to ever choose to destroy/disempower humanity in order to wirehead, instead of just a low percentage, so you need them to be particularly risk-averse.
Delicious food is perhaps a better example of wireheading in humans. In this case, it’s not against the law, it’s not that shunned socially, and it is ***absolutely ubiquitous***. In general, any positive chemical feeling we have in our brains (either from drugs or cheeseburgers) can be seen as (often “internally misaligned”) instrumental goals that we are mesa-optimizing. It’s just that some pathways to those feelings are a lot riskier and more uncertain that others.
And I guess this can translate to RL—an RL agent won’t try everything, but if the risk is low and the expectation is high, it probably will try it. If pressing a button is easy and doesn’t conflict with taking out the trash and doing other things it wants to do, it might try it. And as its generalization capabilities increase, its confidence can make this more likely, I think. So you should therefore increasingly train agents to be more risk-averse and less willing to break specific rules and norms as their generalization capabilities increase.
Delicious food does seem like a good (but IMO weak) point in favor of reward-optimization, and pushes up my P(AI cares a lot about reward terminally) a tiny bit. But also note that lots of people (including myself) don’t care very much about delicious food, and it seems like the vast majority of people don’t make their lives primarily about delicious food or other tight correlates of their sensory pleasure circuits.
This is compatible with one intended main point of this essay, which is that while reward optimization might be a convergent secondary goal, it probably won’t be the agent’s primary motivation.
That’s true. There are built-in decreasing marginal returns to eating massive quantities of delicious food (you get full), but we don’t see a huge number of—for example—bulimics who are bulimic for the core purpose of being able to eat more.
However, I’d mention that yummy food is only one of many things that are brains are hard-wired to mesa-optimize for. Social acceptance and social status (particularly within the circles we care about, i.e. usually the circles we are likely to succeed in and get benefit from) are very big examples that much of our behavior can be ascribed to.
So I guess, reflecting this to humans, would you argue that most human’s primary motivations aren’t motivated mostly by various mesa-objectives our brains are hardwired to have? In my mind this is a hard sell, as most things humans do you can trace back (sometimes incorrectly, sure) to some thing that was evolutionary advantageous (mesa-objective that led to genetic fitness). The whole area of evolutionary biology specializes in coming up with (hard to prove and sometimes convoluted) explanations here relating to both our behavior and physiology.
For example, you could argue that us posting hopefully smart things here is giving our brains happy juice relating to social status / intelligence signaling / social interaction, which in our evolutionary history increased the probability that we would find high quality partners to make lots of high quality babies with. I guess, if mesa-objectives aren’t the primary drivers of us humans—what is, and how can you be sure?
Yeah, the food that is served in fast food restaurants, and arguably a lot of society, basically wireheads our reward centers, and to a large extent is why obesity is such a huge problem in the modern era.
Obesity is the first example of real life wireheading, at least in a weak sense. So now that I think about it, I think TurnTrout is too optimistic about RL models not optimizing reward.