A function that tells your AI system whether an action looks good and is right virtually all of the time on natural inputs isn’t safe if you use it to drive an enormous search for unnatural (highly optimized) inputs on which it might behave very differently.
Yeah, you can have something which is “a brilliant out-of-the-box solution to a tricky problem” from the AI’s perspective, but is “reward-hacking / Goodharting the value function” from the programmer’s perspective. You say tomato, I say to-mah-to.
It’s tricky because there’s economic pressure to make AIs that will find and execute brilliant out-of-the-box solutions. But we want our AIs to think outside of some of the boxes (e.g. yes you can repurpose a spare server rack frame for makeshift cable guides), but we want it to definitely stay inside other boxes (e.g. no you can’t take over the world). Unfortunately, the whole idea of “think outside the box” is that we’re not aware of all the boxes that we’re thinking inside of.
All three of those examples are of the form “hey here’s a lot of samples from a distribution, please output another sample from the same distribution”, which is not the kind of problem where anyone would ever expect adversarial dynamics / weird edge-cases, right?
(…Unless you do conditional sampling of a learned distribution, where you constrain the samples to be in a specific a-priori-extremely-unlikely subspace, in which case sampling becomes isomorphic to optimization in theory. (Because you can sample from the distribution of (reward, trajectory) pairs conditional on high reward.))
Or maybe you were making a different point in this particular paragraph?
(…Unless you do conditional sampling of a learned distribution, where you constrain the samples to be in a specific a-priori-extremely-unlikely subspace, in which case sampling becomes isomorphic to optimization in theory. (Because you can sample from the distribution of (reward, trajectory) pairs conditional on high reward.))
Does this isomorphism actually go through? I know decision transformers kinda-sorta show how you can do optimization-through-conditioning in practice, but in theory the loss function which you use to learn the distribution doesn’t constrain the results of conditioning off-distribution, so I’d think you’re mainly relying on having picked a good architecture which generalizes nicely out of distribution.
Let D be the distribution of (reward, trajectory) pairs for every possible trajectory.
Split D into two subsets: D1 where reward > 7 and D2 where reward ≤ 7.
Suppose that, in D, 1-in-a-googol samples is in the subset D1, and all the rest are in D2.
(For example, if a videogame involves pressing buttons for 20 minutes, you can easily have less than 1-in-a-googol chance of beating even the first mini-boss if you press the buttons randomly.)
Now we randomly pick a million samples from D in an attempt to learn the distribution D. But (as expected with overwhelming probability), it turns out that every single one of those million samples are more specifically in the subset D2.
Now consider a point X in D1 (its reward is 30). Question: Is X “out of distribution”?
Arguably no, because we set up a procedure to learn the distribution D, and D contains X.
Arguably yes, because when we ran the procedure, all the points actually used to learn the distribution were in D2, so we were kinda really only learning D2, and D2 doesn’t contain X.
(In the videogame example, if in training you never see a run that gets past the first mini-boss, then certainly intuitively we’d say that runs that do get past it, and that thus get to the next stage of the game, are OOD.)
Anyway, I was gonna say the opposite of what you said—sufficiently hard optimization via conditional sampling works in theory (i.e., if you could somehow learn D and conditionally sample on reward>7, it would work), but not in practice (because reward>7 is so hard to come by that you will never actually learn that part of D by random sampling).
The way I’d phrase the theoretical problem when you fit a model to a distribution (e.g. minimizing KL-divergence on a set of samples), you can often prove theorems of the form “the fitted-distribution has such-and-such relationship to the true distribution”, e.g. you can compute confidence intervals for parameters and predictions in linear regression.
Often, all that is sufficient for those theorems to hold is:
The model is at an optimum
The model is flexible enough
The sample size is big enough
… because then if you have some point X you want to make predictions for, the sample size being big enough means you have a whole bunch of points in the empirical distribution that are similar to X. These points affect the loss landscape, and because you’ve got a flexible optimal model, that forces the model to approximate them well enough.
But this “you’ve got a bunch of empirical points dragging the loss around in relevant ways” part only works on-distribution, because you don’t have a bunch of empirical points from off-distribution data. Even if technically they form an exponentially small slice of the true distribution, this means they only have an exponentially small effect on the loss function, and therefore being at an optimal loss is exponentially weakly informative about these points.
(Obviously this is somewhat complicated by overfitting, double descent, etc., but I think the gist of the argument goes through.)
I guess it depends on whether one makes the cut between theory and practice with or without assuming that one has learned the distribution? I.e. I’m saying if you have a distribution D, take some samples E, and approximate E with Q, then you might be able to prove that samples from Q are similar to samples from D, but you can’t prove that conditioning on something exponentially unlikely in D gives you something reasonable in Q. Meanwhile you’re saying that conditioning on something exponentially unlikely in D is tantamount to optimization.
All three of those examples are of the form “hey here’s a lot of samples from a distribution, please output another sample from the same distribution”, which is not the kind of problem where anyone would ever expect adversarial dynamics / weird edge-cases, right?
I would expect some adversarial dynamics/weird edge cases in such cases. The International Obfuscated C Code Contest is a thing that exists, and so a sequence predictor that has been trained to produce content similar to an input distribution that included the entrants to that contest will place some nonzero level of probability that the C code it writes should actually contain a sneaky backdoor, and, once the sneaky backdoor is there, would place a fairly high probability that the next token it outputs should be another part of the same program.
Such weird/adversarial outputs probably won’t be a super large part of your input space unless you’re sampling in such a way as to make them overrepresented, though.
Unless you do conditional sampling of a learned distribution, where you constrain the samples to be in a specific a-priori-extremely-unlikely subspace, in which case sampling becomes isomorphic to optimization in theory
Right. I think the optimists would say that conditional sampling works great in practice, and that this bodes well for applying similar techniques to more ambitious domains. There’s no chance of this image being in the Stable Diffusion pretraining set:
One could reply, “Oh, sure, it’s obvious that you can conditionally sample a learned distribution to safely do all sorts of economically valuable cognitive tasks, but that’s not the danger of true AGI.” And I ultimately think you’re correct about that. But I don’t think the conditional-sampling thing was obvious in 2004.
One could reply, “Oh, sure, it’s obvious that you can conditionally sample a learned distribution to safely do all sorts of economically valuable cognitive tasks, but that’s not the danger of true AGI.” And I ultimately think you’re correct about that. But I don’t think the conditional-sampling thing was obvious in 2004.
Idk. We already knew that you could use basic regression and singular vector methods to do lots of economically valuable tasks, since that was something that was done in 2004. Conditional-sampling “just” adds in the noise around these sorts of methods, so it goes to say that this might work too.
Adding noise obviously doesn’t matter in 1 dimension except for making the outcomes worse. The reason we use it for e.g. images is that adding the noise does matter in high-dimensional spaces because without the noise you end up with the highest-probability outcome, which is out of distribution. So in a way it seems like a relatively minor fix to generalize something we already knew was profitable in lots of cases.
On the other hand, I didn’t learn the probability thing until playing with some neural network ideas for outlier detection and learning they didn’t work. So in that sense it’s literally true that it wasn’t obvious (to a lot of people) back before deep learning took off.
And I can’t deny that people were surprised that neural networks could learn to do art. To me this became relatively obvious with early GANs, which were later than 2004 but earlier than most people updated on it.
Yeah, you can have something which is “a brilliant out-of-the-box solution to a tricky problem” from the AI’s perspective, but is “reward-hacking / Goodharting the value function” from the programmer’s perspective. You say tomato, I say to-mah-to.
It’s tricky because there’s economic pressure to make AIs that will find and execute brilliant out-of-the-box solutions. But we want our AIs to think outside of some of the boxes (e.g. yes you can repurpose a spare server rack frame for makeshift cable guides), but we want it to definitely stay inside other boxes (e.g. no you can’t take over the world). Unfortunately, the whole idea of “think outside the box” is that we’re not aware of all the boxes that we’re thinking inside of.
All three of those examples are of the form “hey here’s a lot of samples from a distribution, please output another sample from the same distribution”, which is not the kind of problem where anyone would ever expect adversarial dynamics / weird edge-cases, right?
(…Unless you do conditional sampling of a learned distribution, where you constrain the samples to be in a specific a-priori-extremely-unlikely subspace, in which case sampling becomes isomorphic to optimization in theory. (Because you can sample from the distribution of (reward, trajectory) pairs conditional on high reward.))
Or maybe you were making a different point in this particular paragraph?
Does this isomorphism actually go through? I know decision transformers kinda-sorta show how you can do optimization-through-conditioning in practice, but in theory the loss function which you use to learn the distribution doesn’t constrain the results of conditioning off-distribution, so I’d think you’re mainly relying on having picked a good architecture which generalizes nicely out of distribution.
Let D be the distribution of (reward, trajectory) pairs for every possible trajectory.
Split D into two subsets: D1 where reward > 7 and D2 where reward ≤ 7.
Suppose that, in D, 1-in-a-googol samples is in the subset D1, and all the rest are in D2.
(For example, if a videogame involves pressing buttons for 20 minutes, you can easily have less than 1-in-a-googol chance of beating even the first mini-boss if you press the buttons randomly.)
Now we randomly pick a million samples from D in an attempt to learn the distribution D. But (as expected with overwhelming probability), it turns out that every single one of those million samples are more specifically in the subset D2.
Now consider a point X in D1 (its reward is 30). Question: Is X “out of distribution”?
Arguably no, because we set up a procedure to learn the distribution D, and D contains X.
Arguably yes, because when we ran the procedure, all the points actually used to learn the distribution were in D2, so we were kinda really only learning D2, and D2 doesn’t contain X.
(In the videogame example, if in training you never see a run that gets past the first mini-boss, then certainly intuitively we’d say that runs that do get past it, and that thus get to the next stage of the game, are OOD.)
Anyway, I was gonna say the opposite of what you said—sufficiently hard optimization via conditional sampling works in theory (i.e., if you could somehow learn D and conditionally sample on reward>7, it would work), but not in practice (because reward>7 is so hard to come by that you will never actually learn that part of D by random sampling).
The way I’d phrase the theoretical problem when you fit a model to a distribution (e.g. minimizing KL-divergence on a set of samples), you can often prove theorems of the form “the fitted-distribution has such-and-such relationship to the true distribution”, e.g. you can compute confidence intervals for parameters and predictions in linear regression.
Often, all that is sufficient for those theorems to hold is:
The model is at an optimum
The model is flexible enough
The sample size is big enough
… because then if you have some point X you want to make predictions for, the sample size being big enough means you have a whole bunch of points in the empirical distribution that are similar to X. These points affect the loss landscape, and because you’ve got a flexible optimal model, that forces the model to approximate them well enough.
But this “you’ve got a bunch of empirical points dragging the loss around in relevant ways” part only works on-distribution, because you don’t have a bunch of empirical points from off-distribution data. Even if technically they form an exponentially small slice of the true distribution, this means they only have an exponentially small effect on the loss function, and therefore being at an optimal loss is exponentially weakly informative about these points.
(Obviously this is somewhat complicated by overfitting, double descent, etc., but I think the gist of the argument goes through.)
I guess it depends on whether one makes the cut between theory and practice with or without assuming that one has learned the distribution? I.e. I’m saying if you have a distribution D, take some samples E, and approximate E with Q, then you might be able to prove that samples from Q are similar to samples from D, but you can’t prove that conditioning on something exponentially unlikely in D gives you something reasonable in Q. Meanwhile you’re saying that conditioning on something exponentially unlikely in D is tantamount to optimization.
I would expect some adversarial dynamics/weird edge cases in such cases. The International Obfuscated C Code Contest is a thing that exists, and so a sequence predictor that has been trained to produce content similar to an input distribution that included the entrants to that contest will place some nonzero level of probability that the C code it writes should actually contain a sneaky backdoor, and, once the sneaky backdoor is there, would place a fairly high probability that the next token it outputs should be another part of the same program.
Such weird/adversarial outputs probably won’t be a super large part of your input space unless you’re sampling in such a way as to make them overrepresented, though.
Right. I think the optimists would say that conditional sampling works great in practice, and that this bodes well for applying similar techniques to more ambitious domains. There’s no chance of this image being in the Stable Diffusion pretraining set:
One could reply, “Oh, sure, it’s obvious that you can conditionally sample a learned distribution to safely do all sorts of economically valuable cognitive tasks, but that’s not the danger of true AGI.” And I ultimately think you’re correct about that. But I don’t think the conditional-sampling thing was obvious in 2004.
Idk. We already knew that you could use basic regression and singular vector methods to do lots of economically valuable tasks, since that was something that was done in 2004. Conditional-sampling “just” adds in the noise around these sorts of methods, so it goes to say that this might work too.
Adding noise obviously doesn’t matter in 1 dimension except for making the outcomes worse. The reason we use it for e.g. images is that adding the noise does matter in high-dimensional spaces because without the noise you end up with the highest-probability outcome, which is out of distribution. So in a way it seems like a relatively minor fix to generalize something we already knew was profitable in lots of cases.
On the other hand, I didn’t learn the probability thing until playing with some neural network ideas for outlier detection and learning they didn’t work. So in that sense it’s literally true that it wasn’t obvious (to a lot of people) back before deep learning took off.
And I can’t deny that people were surprised that neural networks could learn to do art. To me this became relatively obvious with early GANs, which were later than 2004 but earlier than most people updated on it.
So basically I don’t disagree but in retrospect it doesn’t seem that shocking.