To some extent, I think it’s easy to pooh-pooh finding a robust reward function (not maximally robust, merely way better than the state of the art) when you’re not proposing a specific design for building an AI that does good things and not bad things. Not in the tone of “how dare you not talk about specifics,” but more like “I bet this research direction would have to look more orthodox when you get down to brass tacks.”
I thought so as well, but then checked Ajeya’s posts and realized many of them are significantly longer than this one, but still heavily commented. So I figured that people can just read this post if they want. I do think this is one of the most important posts I’ve ever written, and was surprised to see so few comments.
To some extent, I think it’s easy to pooh-pooh finding a robust reward function (not maximally robust, merely way better than the state of the art) when you’re not proposing a specific design for building an AI that does good things and not bad things. Not in the tone of “how dare you not talk about specifics,” but more like “I bet this research direction would have to look more orthodox when you get down to brass tacks.”
Someone saying ~this to me in an earlier draft of this post is actually why I wrote A shot at the diamond alignment problem. You’ll notice that I didn’t confront e.g. robust grading in that story, nor do the failure modes hinge on robust OOD grading by the reward function, nor do I think analogous challenges crop up elsewhere.
Yeah, fair enough. But I think stories about the diamond-maximizer, or value-child, do rely on robustness.
I’d split up “OOD performance” into two extremes. One extreme, let’s call it “out of the generator” is situations that have no coherent reason to happen in our universe—they’re thought experiments that you only find by random sampling, or by searching for adversarial examples, or other silly processes. E.g. a sequence of flashing lights and clicks that brainwashes you into liking to kick puppies. The other, “out of the dataset,” is things that are heavily implied to exist (or have straightforward ways of coming about) by the training data, but aren’t actually in the training data. E.g. a sequence of YouTube videos that teach you how to knit. Or like how avocado chairs weren’t in Dall-E’s training data.
When training your real-world agent with supervised RL, you have to be grading it somehow, and that grading process is constantly going to be presented with new inputs, which weren’t in the training dataset until now but are logical consequences of a lawful universe, and could be predicted to happen by an AI that’s modeling that universe. On these data points, you want your reward function to robustly keep being about the thing you want to teach to the AI, rather than having bad behavior that depends on the reward function’s implementation details (e.g. failing to reward the agent for being near a diamond when the AI has hidden the diamond from the evaluation system).
I’m worried that sentiments like “there has to be robustness” can sometimes lose track of what level of robustness we’re talking about—what specific situations must be graded “robustly” in order for the training story to work. [EDIT although I think you take some steps to avoid this failure, in your comment here!] To further ward against this failure mode—for concreteness, what difficult situations do you think might need to be rewarded properly in order for the diamond-AI to generalize appropriately?
On these data points, you want your reward function to robustly keep being about the thing you want to teach to the AI, rather than having bad behavior that depends on the reward function’s implementation details (e.g. failing to reward the agent for being near a diamond when the AI has hidden the diamond from the evaluation system).
Actually, I think I disagree. Why do you think this?
Actually, I think I disagree. Why do you think this?
Maybe it’s something like too many natural abstractions. When the number of natural abstractions is small, you can just point in the right general direction, and then regularize your way to teaching the exact natural abstraction that’s closest. When the number of abstractions is large, or you’re trying to point to something very complicated, if you just point in the right general direction, there will be a natural abstraction almost wherever you point, and regularization won’t move you towards something that seems privileged to humans.
Closeness of natural abstractions also makes it easier for gradient descent to change your goals—shards are now on a continuum, rather than moated off from each other. The typical picture of value change due to stimulus is something like heroin, which hijacks the reward center in a way that we typically picture as “creating new desires related to heroin.” But if shards can be moved around by gradient descent, then you can have a different-looking kind of value change, of which an example might be updating a political tenet because the culture around you changes—it’s still somewhat resisted by the prior shard, but it’s hard to avoid because each change is small and the gradient updates are a consequence of a deep part of the environment, and it doesn’t have to lead to internal disagreement, at each point in time one’s values are just slowly changing in place.
So information leakage that reflects unintended optima of the actual evaluation function is bad for alignment with vanilla RL. E.g. systematic classification errors, or not working for a few minutes when some software freezes, or systematic biases on what kind of diamonds you’re showing it, or accidentally showing it some cubic zirconium. This is going to update its values to something with more unintended optima, although not necessarily exactly the same unintended optima as were in the reward evaluation process.
Even given all of this, why should reward function “robustness” be the natural solution to this? Like, what if you get your robust reward function and you’re still screwed? It’s very nonobvious that this is how you fix things.
Even given that we need on-trajectory reward “robustness” (i.e. very carefully reward all in-fact-experienced situations relating to diamonds, until the AI becomes smart enough to steer its own training), this is extremely different from a forall-across-counterfactuals robust grading guarantee.
So even given both points, I would conclude “yup, shard theory reasoning shows I can dodge an enormous robust-grading sized bullet. No dealing with ‘nearest unblocked strategy’, here!” And that was the original point of dispute, AFAICT.
something with more unintended optima
What do you have in mind with “unintended optima”? This phrasing seems to suggest that alignment is reasonably formulated as a global optimization problem, which I think is probably not true in the currently understood sense. But maybe that’s not what you meant?
Even given all of this, why should reward function “robustness” be the natural solution to this? Like, what if you get your robust reward function and you’re still screwed? It’s very nonobvious that this is how you fix things.
Yeah, I sorta got sucked into playing pretend, here. I don’t actually have much hope for trying to pick out a concept we’d want just by pointing into a self-supervised world-model—I expect us to need to use human feedback and the AI’s self-reflectivity, which means that the AI has to want human feedback, and be able to reflect on itself, not just get pointed in the right direction in a single push. In the pretend-world where you start out able to pick out some good “human values”-esque concept from the very start, though, it definitely seems important to defend that concept from getting updated to something else.
What do you have in mind with “unintended optima”?
Sort of like in Goodhart Ethology. In situations where humans have a good grasp on what’s going on, we can pick out some fairly unambiguous properties of good vs. bad ways the world could go. If the AI is doing search over plans, guided by some values that care about the world, then what I mean by an “unintended optimum” of those values will lead to its search process outputting plans that make the world go badly according to these human-obvious standards. (And an unintended optimum of the reward function rewards trajectories that are obviously bad).
And an unintended optimum of the reward function rewards trajectories that are obviously bad
It seems not relevant if it’s an optimum or not. What’s relevant is the scalar reward values output on realized datapoints.
I emphasize this because “unintended optimum” phrasing seems to reliably trigger cached thoughts around “reward functions need to be robust graders.” (I also don’t like “optimum” of values, because I think that’s really not how values work in detail instead of in gloss, and “optimum” probably evokes similar thoughts around “values must be robust against adversaries.”)
To some extent, I think it’s easy to pooh-pooh finding a flapping wing design (not maximally flappy, merely way better than the best birds) when you’re not proposing a specific design for building a flying machine that can go to space. Not in the tone of “how dare you not talk about specifics,” but more like “I bet this chemical propulsion direction would have to look more like birds when you get down to brass tacks.”
Wait, but surely RL-developed shards that work like human values are the biomimicry approach here, and designing a value learning scheme top-down is the modernist approach. I think this metaphor has its wires crossed.
I wasn’t intending for a metaphor of “biomimicry” vs “modernist”.
(Claim 1) Wings can’t work in space because there’s no air. The lack of air is a fundamental reason for why no wing design, no matter how clever it is, will ever solve space travel.
If TurnTrout is right, then the equivalent statement is something like (Claim 2) “reward functions can’t solve alignment because alignment isn’t maximizing a mathematical function.”
The difference between Claim 1 and Claim 2 is that we have a proof of Claim 1, and therefore don’t bother debating it anymore, while with Claim 2 we only have an arbitrarily long list of examples for why reward functions can be gamed, exploited, or otherwise fail in spectacular ways, but no general proof yet for why reward functions will never work, so we keep arguing about a Sufficiently Smart Reward Function That Definitely Won’t Blow up as if that is a thing that can be found if we try hard enough.
As of right now, I view “shard theory” sort of like a high-level discussion of chemical propulsion without the designs for a rocket or a gun. I see the novelty of it, but I don’t understand how you would build a device that can use it. Until someone can propose actual designs for hardware or software that would implement “shard theory” concepts without just becoming an obfuscated reward function prone to the same failure modes as everything else, it’s not incredibly useful to me. However, I think it’s worth engaging with the idea because if correct then other research directions might be a dead-end.
Does that help explain what I was trying to do with the metaphor?
Until someone can propose actual designs for hardware or software that would implement “shard theory” concepts without just becoming an obfuscated reward function prone to the same failure modes as everything else, it’s not incredibly useful to me. However, I think it’s worth engaging with the idea because if correct then other research directions might be a dead-end.
Yeah, but on the other hand, I think this is looking for essential differences where they don’t exist. I made a comment similar to this on the previous post. It’s not like one side is building rockets and the other side is building ornithopters—or one side is advocating building computers out of evilite, while the other side says we should build the computer out of alignmentronium.
“reward functions can’t solve alignment because alignment isn’t maximizing a mathematical function.”
Alignment doesn’t run on some nega-math that can’t be cast as an optimization problem. If you look at the example of the value-child who really wants to learn a lot in school, I admit it’s a bit tricky to cash this out in terms of optimization. But if the lesson you take from this is “it works because it really wants to succeed, this is a property that cannot be translated as maximizing a mathematical function,” then I think that’s a drastic overreach.
I realize that my position might seem increasingly flippant, but I really think it is necessary to acknowledge that you’ve stated a core assumption as a fact.
Alignment doesn’t run on some nega-math that can’t be cast as an optimization problem.
I am not saying that the concept of “alignment” is some bizarre meta-physical idea that cannot be approximated by a computer because something something human souls etc, or some other nonsense.
However the assumption that “alignment is representable in math” directly implies “alignment is representable as an optimization problem” seems potentially false to me, and I’m not sure why you’re certain it is true.
There exist systems that can be 1.) represented mathematically, 2.) perform computations, and 3.) do not correspond to some type of min/max optimization, e.g. various analog computers or cellular automaton.
I don’t think it is ridiculous to suggest that what the human brain does is 1.) representable in math, 2.) in some type of way that we could actually understand and re-implement it on hardware / software systems, and 3.) but not as an optimization problem where there exists some reward function to maximize or some loss function to minimize.
There exist systems that can be 1.) represented mathematically, 2.) perform computations, and 3.) do not correspond to some type of min/max optimization, e.g. various analog computers or cellular automaton.
You don’t even have to go that far. What about, just, regular non-iterative programs? Are type(obj) or json.dump(dict) or resnet50(image) usefully/nontrivially recast as optimization programs? AFAICT there are a ton of things that are made up of normal math/computation and where trying to recast them as optimization problems isn’t helpful.
I think this post was potentially too long :P
To some extent, I think it’s easy to pooh-pooh finding a robust reward function (not maximally robust, merely way better than the state of the art) when you’re not proposing a specific design for building an AI that does good things and not bad things. Not in the tone of “how dare you not talk about specifics,” but more like “I bet this research direction would have to look more orthodox when you get down to brass tacks.”
I thought so as well, but then checked Ajeya’s posts and realized many of them are significantly longer than this one, but still heavily commented. So I figured that people can just read this post if they want. I do think this is one of the most important posts I’ve ever written, and was surprised to see so few comments.
Someone saying ~this to me in an earlier draft of this post is actually why I wrote A shot at the diamond alignment problem. You’ll notice that I didn’t confront e.g. robust grading in that story, nor do the failure modes hinge on robust OOD grading by the reward function, nor do I think analogous challenges crop up elsewhere.
Yeah, fair enough. But I think stories about the diamond-maximizer, or value-child, do rely on robustness.
I’d split up “OOD performance” into two extremes. One extreme, let’s call it “out of the generator” is situations that have no coherent reason to happen in our universe—they’re thought experiments that you only find by random sampling, or by searching for adversarial examples, or other silly processes. E.g. a sequence of flashing lights and clicks that brainwashes you into liking to kick puppies. The other, “out of the dataset,” is things that are heavily implied to exist (or have straightforward ways of coming about) by the training data, but aren’t actually in the training data. E.g. a sequence of YouTube videos that teach you how to knit. Or like how avocado chairs weren’t in Dall-E’s training data.
When training your real-world agent with supervised RL, you have to be grading it somehow, and that grading process is constantly going to be presented with new inputs, which weren’t in the training dataset until now but are logical consequences of a lawful universe, and could be predicted to happen by an AI that’s modeling that universe. On these data points, you want your reward function to robustly keep being about the thing you want to teach to the AI, rather than having bad behavior that depends on the reward function’s implementation details (e.g. failing to reward the agent for being near a diamond when the AI has hidden the diamond from the evaluation system).
I’m worried that sentiments like “there has to be robustness” can sometimes lose track of what level of robustness we’re talking about—what specific situations must be graded “robustly” in order for the training story to work. [EDIT although I think you take some steps to avoid this failure, in your comment here!] To further ward against this failure mode—for concreteness, what difficult situations do you think might need to be rewarded properly in order for the diamond-AI to generalize appropriately?
Actually, I think I disagree. Why do you think this?
Maybe it’s something like too many natural abstractions. When the number of natural abstractions is small, you can just point in the right general direction, and then regularize your way to teaching the exact natural abstraction that’s closest. When the number of abstractions is large, or you’re trying to point to something very complicated, if you just point in the right general direction, there will be a natural abstraction almost wherever you point, and regularization won’t move you towards something that seems privileged to humans.
Closeness of natural abstractions also makes it easier for gradient descent to change your goals—shards are now on a continuum, rather than moated off from each other. The typical picture of value change due to stimulus is something like heroin, which hijacks the reward center in a way that we typically picture as “creating new desires related to heroin.” But if shards can be moved around by gradient descent, then you can have a different-looking kind of value change, of which an example might be updating a political tenet because the culture around you changes—it’s still somewhat resisted by the prior shard, but it’s hard to avoid because each change is small and the gradient updates are a consequence of a deep part of the environment, and it doesn’t have to lead to internal disagreement, at each point in time one’s values are just slowly changing in place.
So information leakage that reflects unintended optima of the actual evaluation function is bad for alignment with vanilla RL. E.g. systematic classification errors, or not working for a few minutes when some software freezes, or systematic biases on what kind of diamonds you’re showing it, or accidentally showing it some cubic zirconium. This is going to update its values to something with more unintended optima, although not necessarily exactly the same unintended optima as were in the reward evaluation process.
Even given all of this, why should reward function “robustness” be the natural solution to this? Like, what if you get your robust reward function and you’re still screwed? It’s very nonobvious that this is how you fix things.
Even given that we need on-trajectory reward “robustness” (i.e. very carefully reward all in-fact-experienced situations relating to diamonds, until the AI becomes smart enough to steer its own training), this is extremely different from a forall-across-counterfactuals robust grading guarantee.
So even given both points, I would conclude “yup, shard theory reasoning shows I can dodge an enormous robust-grading sized bullet. No dealing with ‘nearest unblocked strategy’, here!” And that was the original point of dispute, AFAICT.
What do you have in mind with “unintended optima”? This phrasing seems to suggest that alignment is reasonably formulated as a global optimization problem, which I think is probably not true in the currently understood sense. But maybe that’s not what you meant?
Yeah, I sorta got sucked into playing pretend, here. I don’t actually have much hope for trying to pick out a concept we’d want just by pointing into a self-supervised world-model—I expect us to need to use human feedback and the AI’s self-reflectivity, which means that the AI has to want human feedback, and be able to reflect on itself, not just get pointed in the right direction in a single push. In the pretend-world where you start out able to pick out some good “human values”-esque concept from the very start, though, it definitely seems important to defend that concept from getting updated to something else.
Sort of like in Goodhart Ethology. In situations where humans have a good grasp on what’s going on, we can pick out some fairly unambiguous properties of good vs. bad ways the world could go. If the AI is doing search over plans, guided by some values that care about the world, then what I mean by an “unintended optimum” of those values will lead to its search process outputting plans that make the world go badly according to these human-obvious standards. (And an unintended optimum of the reward function rewards trajectories that are obviously bad).
It seems not relevant if it’s an optimum or not. What’s relevant is the scalar reward values output on realized datapoints.
I emphasize this because “unintended optimum” phrasing seems to reliably trigger cached thoughts around “reward functions need to be robust graders.” (I also don’t like “optimum” of values, because I think that’s really not how values work in detail instead of in gloss, and “optimum” probably evokes similar thoughts around “values must be robust against adversaries.”)
Wait, but surely RL-developed shards that work like human values are the biomimicry approach here, and designing a value learning scheme top-down is the modernist approach. I think this metaphor has its wires crossed.
I wasn’t intending for a metaphor of “biomimicry” vs “modernist”.
(Claim 1) Wings can’t work in space because there’s no air. The lack of air is a fundamental reason for why no wing design, no matter how clever it is, will ever solve space travel.
If TurnTrout is right, then the equivalent statement is something like (Claim 2) “reward functions can’t solve alignment because alignment isn’t maximizing a mathematical function.”
The difference between Claim 1 and Claim 2 is that we have a proof of Claim 1, and therefore don’t bother debating it anymore, while with Claim 2 we only have an arbitrarily long list of examples for why reward functions can be gamed, exploited, or otherwise fail in spectacular ways, but no general proof yet for why reward functions will never work, so we keep arguing about a Sufficiently Smart Reward Function That Definitely Won’t Blow up as if that is a thing that can be found if we try hard enough.
As of right now, I view “shard theory” sort of like a high-level discussion of chemical propulsion without the designs for a rocket or a gun. I see the novelty of it, but I don’t understand how you would build a device that can use it. Until someone can propose actual designs for hardware or software that would implement “shard theory” concepts without just becoming an obfuscated reward function prone to the same failure modes as everything else, it’s not incredibly useful to me. However, I think it’s worth engaging with the idea because if correct then other research directions might be a dead-end.
Does that help explain what I was trying to do with the metaphor?
Have you read A shot at the diamond alignment problem? If so, what do you think of it?
Yeah, but on the other hand, I think this is looking for essential differences where they don’t exist. I made a comment similar to this on the previous post. It’s not like one side is building rockets and the other side is building ornithopters—or one side is advocating building computers out of evilite, while the other side says we should build the computer out of alignmentronium.
Alignment doesn’t run on some nega-math that can’t be cast as an optimization problem. If you look at the example of the value-child who really wants to learn a lot in school, I admit it’s a bit tricky to cash this out in terms of optimization. But if the lesson you take from this is “it works because it really wants to succeed, this is a property that cannot be translated as maximizing a mathematical function,” then I think that’s a drastic overreach.
I realize that my position might seem increasingly flippant, but I really think it is necessary to acknowledge that you’ve stated a core assumption as a fact.
I am not saying that the concept of “alignment” is some bizarre meta-physical idea that cannot be approximated by a computer because something something human souls etc, or some other nonsense.
However the assumption that “alignment is representable in math” directly implies “alignment is representable as an optimization problem” seems potentially false to me, and I’m not sure why you’re certain it is true.
There exist systems that can be 1.) represented mathematically, 2.) perform computations, and 3.) do not correspond to some type of min/max optimization, e.g. various analog computers or cellular automaton.
I don’t think it is ridiculous to suggest that what the human brain does is 1.) representable in math, 2.) in some type of way that we could actually understand and re-implement it on hardware / software systems, and 3.) but not as an optimization problem where there exists some reward function to maximize or some loss function to minimize.
You don’t even have to go that far. What about, just, regular non-iterative programs? Are
type(obj)
orjson.dump(dict)
orresnet50(image)
usefully/nontrivially recast as optimization programs? AFAICT there are a ton of things that are made up of normal math/computation and where trying to recast them as optimization problems isn’t helpful.