It seems like most of your arguments apply equally well to the (human learning, human reward system) as they do to the (AI, human rewards feedback). If anything, the human reward system is probably easier to exploit, since it’s much dumber and was optimized for our ancestral environment. Maybe address this connection directly and discuss why you expect the AI case to go so much worse than the human case?
Also, some of the arguments seem to fall flat when applied to humans. E.g., “Power-seeking policies would choose high-reward behaviors for instrumental reasons… It decreases the likelihood that gradient descent significantly changes the policy’s goals.”
If you’re a human with an internally represented goal of, say, petting as many dogs as possible, the optimal goal preserving action is not to instrumentally maximize your total reward. Doing a ton of drugs would, I expect, just make you addicted to drugs, even if the goal you had in mind was dog petting.
Of course, humans do manipulate our reward system to preserve / change what internally represented goals we tend to pursue, but this tends to look like timed allocations of small amounts of reward, not a context-independent search for high reward.
You make reference to evolution specifying goals for humans. I think it’s more accurate to say that evolution specified a learning process which, when deployed in the right sort of environment, tends to form certain types of goals.
“Call a goal broadly-scoped if it applies to long timeframes, large scales, wide ranges of tasks, or unprecedented situations 18, and narrowly-scoped if it doesn’t. Broadly-scoped goals are illustrated by human behavior: we usually choose actions we predict will cause our desired outcomes even when we are in unfamiliar situations, often by extrapolating to more ambitious versions of the original goal.”
I think that most human behavior is in pursuit of very short term goals. Even when a human is nominally acting in service of a long term goal, the actual cognitive algorithms they execute will typically make individual decisions based on short horizon subgoals. E.g., the reason I’m nominally commenting on this post is to improve the odds of alignment going well. However, the actual cognition I’m using to write my comment is focused on much shorter term goals like “write a reasonable comment”, “highlight areas of disagreement/possible improvement”, or “avoid spending too much time on this comment”. Very little of my actual output is decided by computations that look like argmax P(alignment goes well | comment content).
When discussing how RL systems might acquire long term goals, maybe discuss the common finding that having competency / goals over long time horizons is difficult for current RL systems?
Some quick thoughts:
It seems like most of your arguments apply equally well to the (human learning, human reward system) as they do to the (AI, human rewards feedback). If anything, the human reward system is probably easier to exploit, since it’s much dumber and was optimized for our ancestral environment. Maybe address this connection directly and discuss why you expect the AI case to go so much worse than the human case?
Also, some of the arguments seem to fall flat when applied to humans. E.g., “Power-seeking policies would choose high-reward behaviors for instrumental reasons… It decreases the likelihood that gradient descent significantly changes the policy’s goals.”
If you’re a human with an internally represented goal of, say, petting as many dogs as possible, the optimal goal preserving action is not to instrumentally maximize your total reward. Doing a ton of drugs would, I expect, just make you addicted to drugs, even if the goal you had in mind was dog petting.
Of course, humans do manipulate our reward system to preserve / change what internally represented goals we tend to pursue, but this tends to look like timed allocations of small amounts of reward, not a context-independent search for high reward.
You make reference to evolution specifying goals for humans. I think it’s more accurate to say that evolution specified a learning process which, when deployed in the right sort of environment, tends to form certain types of goals.
“Call a goal broadly-scoped if it applies to long timeframes, large scales, wide ranges of tasks, or unprecedented situations 18, and narrowly-scoped if it doesn’t. Broadly-scoped goals are illustrated by human behavior: we usually choose actions we predict will cause our desired outcomes even when we are in unfamiliar situations, often by extrapolating to more ambitious versions of the original goal.”
I think that most human behavior is in pursuit of very short term goals. Even when a human is nominally acting in service of a long term goal, the actual cognitive algorithms they execute will typically make individual decisions based on short horizon subgoals. E.g., the reason I’m nominally commenting on this post is to improve the odds of alignment going well. However, the actual cognition I’m using to write my comment is focused on much shorter term goals like “write a reasonable comment”, “highlight areas of disagreement/possible improvement”, or “avoid spending too much time on this comment”. Very little of my actual output is decided by computations that look like argmax P(alignment goes well | comment content).
When discussing how RL systems might acquire long term goals, maybe discuss the common finding that having competency / goals over long time horizons is difficult for current RL systems?