A confusion: it seems that Eliezer views research that is predictable as basically-useless. I think I don’t understand what “predictable” means here. In what sense is expected utility quantilization not predictable?
Maybe the point is that coming up with the concept is all that matters, and the experiments that people usually do don’t matter because after coming up with the concept the experiments are predictable? I’m much more sympathetic to that, but then I’m confused why “predictable” implies “useless”; many prosaic alignment papers have as their main contribution a new algorithm, which seems like a similar type of thing as quantilization.
An example that springs to my mind is Abram wrote a blog post in 2018 mentioning the “easy problem of wireheading”. He described both the problem and its solution in like one sentence, and then immediately moved on to the harder problems.
Later on, DeepMind did an experiment that (in my assessment) mostly just endorsed what Abram said as being correct.
For the record, I don’t think that particular DeepMind experiment was zero value, for various reasons. But at the same time, I think that Abram wins hands-down on the metric of “progress towards AI alignment per researcher-hour”, and this is true at both the production and consumption end (I can read Abram’s one sentence much much faster than I can skim the DeepMind paper).
If we had a plausible-to-me plan that gets us to safe & beneficial AGI, I would be really enthusiastic about going back and checking all the assumptions with experiments. That’s how you shore up the foundations, flesh out the details, start developing working code and practical expertise, etc. etc. But I don’t think we have such a plan right now.
Also, there are times when it’s totally unclear a priori what an algorithm will do just by thinking about it, and then obviously the experiments are super useful.
But at the end of the day, I feel like there are experiments that are happening not because it’s the optimal thing to do for AI alignment, but rather because there are very strong pro-experiment forces that exist inside CS / ML / AI research in academia and academia-adjacent labs.
An example: when I first heard the Ought experiments described, I was pretty highly confident how they’d turn out—people would mostly fail to coordinate on any problem without an already-very-obvious factorization. (See here for the kinds of evidence informing that high confidence, though applied to a slightly different question. See here and here for the more general reasoning/world models which underlie that prediction.) From what I’ve heard of the experiments, it seems that that is indeed basically what happened; therefore the experiments provided approximately-zero new information to my model. They were “useless” in that sense.
(I actually think those experiments were worth running just on the small chance that they’d find something very high value, or more likely that the people running them would have some high-value insight, but I’d still say “probably useless” was a reasonable description beforehand.)
I don’t know if Eliezer would agree with this particular example, but I think this is the sort of thing he’s gesturing at.
That one makes sense (to the extent that Eliezer did confidently predict the results), since the main point of the work was to generate information through experiments. I thought the “predictable” part was also meant to apply to a lot of ML work where the main point is to produce new algorithms, but perhaps it was just meant to apply to things like Ought.
I actually think this particular view is worth fleshing out, since it seems to come up over and over again in discussions of what AI alignment work is valuable (versus not).
For example, it does seem to me that >80% of the work in actually writing a published paper (at least amongst papers at CHAI) (EDIT: no longer believe this on reflection, see Rohin’s comment below) involves doing work with results that are predictable to the author after the concept (for example, actually getting your algorithm to run, writing code for experiments, running said experiments, writing up the results into a paper, etc.)
This just doesn’t match my experience at all. Looking through my past AI papers, I only see twopapers where I could predict the results of the experiments on the first algorithm I tried at the beginning of the project. The first one (benefits of assistance) was explicitly meant to be a “communication” paper rather than a “research” paper (at the time of project initiation, rather than in hindsight). The second one (Overcooked) was writing up results that were meant to be the baselines against which the actual unpredictable research (e.g. this) was going to be measured against; it just turned out that that was already sufficiently interesting to the broader community.
(Funny story about the Overcooked paper; we wrote the paper + did the user study in ~two weeks iirc, because it was only two weeks before the deadline that we considered that the “baseline” results might already be interesting enough to warrant a conference paper. It’s now my most-cited AI paper.)
(I’m also not actually sure that I would have predicted the Overcooked results when writing down the first algorithm; the conceptual story felt strong but there are several other papers where the conceptual story felt strong but nonetheless the first thing we tried didn’t work. And in fact we did have to make slight tweaks, like annealing from self-play to BC-play over the course of training, to get our algorithm to work.)
The first hacky / heuristic algorithm we wrote down didn’t work in some cases. We analyzed it a bunch (via experiments) to figure out what sorts of things it wasn’t capturing.
When we eventually had a much more elegant derived-from-math algorithm, I gave a CHAI presentation presenting some experimental results. There were some results I was confused by, where I expected something different from what we got, and I mentioned this. (Specifically these were the results in the case where the robot had a uniform prior over the initial state at time -T). Many people in the room (including at least one person from MIRI) thought for a while and gave their explanation for why this was the behavior you should expect. (I’m pretty sure some even said “this isn’t surprising” or something along those lines.) I remained unconvinced. Upon further investigation we found out that one of Ziebart’s results that we were using had extremely high variance in our setting, since in our setting we only ever had one initial state, rather than sampling several which would give better coverage of the uniform prior. We derived a better version of Ziebart’s result, implemented that, and voila the results were now what I had originally expected.
It took about… 2 weeks (?) between getting this final version of the algorithm and submitting a paper, constituting maybe 15-20% of the total work. Most of that was what I’d call “communication” rather than “research”, e.g. creating another environment to better demonstrate the algorithm’s properties, writing up the paper clearly, making good figures, etc. Good communication seems clearly worth putting effort into.
If you want a deep learning example, consider Learning What To Do by Simulating the Past. The biggest example here is the curriculum—that was not part of the original pseudocode I had written down and was crucial to get it to work.
You might look at this and think that “but the conceptual idea predicted the experiments that were eventually run!” I mean, sure, but then I think your crux is not “were the experiments predictable”, rather it’s “is there any value in going from a conceptual idea to a working implementation”.
It’s also pretty easy to predict the results of experiments in a paper, but that’s because you have the extra evidence that you’re reading a paper. This is super helpful:
The experiments are going to show the algorithm working. They wouldn’t have published the paper otherwise.
The introduction, methods, etc are going to tell you exactly what to expect when you get to the experiments. Even if the authors initially thought the algorithm was going to improve the final score in Atari games, if the algorithm instead improved sample efficiency without changing final score, the introduction is going to be about how the algorithm was inspired by sample efficient learning in humans or whatever.
This is also why I often don’t report on experiments in papers in the Alignment Newsletter; usually the point is just “yes, the conceptual idea worked”.
I don’t know if this is actually true, but one cynical take is that people are used to predicting the results of finished ML work, where they implicitly use (1) and (2) above, and incorrectly conclude that the vast majority of ML experiments are ex ante predictable. And now that they have to predict the outcome of Redwood’s project, before knowing that a paper will result, they implicitly realize that no, it really could go either way. And so they incorrectly conclude that of the ML experiments, Redwood’s project is a rare unpredictable one.
On reflection, I agree with what you said—I think the amount of work it takes to translate a nice sounding idea into anything that actually works on an experimental domain is significant, and what exact work you need is generally not predictable in advance. In particular, I resonated a lot with this paragraph:
I’m also not actually sure that I would have predicted the Overcooked results when writing down the first algorithm; the conceptual story felt strong but there are several other papers where the conceptual story felt strong but nonetheless the first thing we tried didn’t work.
At least from my vantage point, “having a strong story for why a result should be X” is insufficient for ex ante predictions of what exactly the results would be. (Once you condition on that being the story told in a paper, however, the prediction task does become trivial.)
I’m now curious what the MIRI response is, as well as how well their intuitive judgments of the results are calibrated.
EDIT: Here’s another toy model I came up with: you might imagine there are two regimes for science—an experiment driven regime, and a theory driven regime. In the former, it’s easy to generate many “plausible sounding” ideas and hard to be justified in holding on to any of them without experiments. The role of scientists is to be (low credence) idea generators and idea testers, and the purpose of experimentation is to primarily to discover new facts that are surprising to the scientist finding them. In the second regime, the key is to come up with the right theory/deep model of AI that predicts lots of facts correctly ex ante, and then the purpose of experiments is to convince other scientists of the correctness of your idea. Good scientists in the second regime are those who discover the right deep models much faster than others. Obviously this is an oversimplification, and no one believes it’s only one or the other, but I suspect both MIRI and Stuart Russell lie more on the “have the right idea, and the paper experiments are there to convince others/apply the idea in a useful domain” view, while most ML researchers hold the more experimentalist view of research?
A confusion: it seems that Eliezer views research that is predictable as basically-useless. I think I don’t understand what “predictable” means here. In what sense is expected utility quantilization not predictable?
Maybe the point is that coming up with the concept is all that matters, and the experiments that people usually do don’t matter because after coming up with the concept the experiments are predictable? I’m much more sympathetic to that, but then I’m confused why “predictable” implies “useless”; many prosaic alignment papers have as their main contribution a new algorithm, which seems like a similar type of thing as quantilization.
An example that springs to my mind is Abram wrote a blog post in 2018 mentioning the “easy problem of wireheading”. He described both the problem and its solution in like one sentence, and then immediately moved on to the harder problems.
Later on, DeepMind did an experiment that (in my assessment) mostly just endorsed what Abram said as being correct.
For the record, I don’t think that particular DeepMind experiment was zero value, for various reasons. But at the same time, I think that Abram wins hands-down on the metric of “progress towards AI alignment per researcher-hour”, and this is true at both the production and consumption end (I can read Abram’s one sentence much much faster than I can skim the DeepMind paper).
If we had a plausible-to-me plan that gets us to safe & beneficial AGI, I would be really enthusiastic about going back and checking all the assumptions with experiments. That’s how you shore up the foundations, flesh out the details, start developing working code and practical expertise, etc. etc. But I don’t think we have such a plan right now.
Also, there are times when it’s totally unclear a priori what an algorithm will do just by thinking about it, and then obviously the experiments are super useful.
But at the end of the day, I feel like there are experiments that are happening not because it’s the optimal thing to do for AI alignment, but rather because there are very strong pro-experiment forces that exist inside CS / ML / AI research in academia and academia-adjacent labs.
That’s a good example, thanks :)
EDIT: To be clear, I don’t agree with
but I do think this is a good example of what someone might mean when they say work is “predictable”.
An example: when I first heard the Ought experiments described, I was pretty highly confident how they’d turn out—people would mostly fail to coordinate on any problem without an already-very-obvious factorization. (See here for the kinds of evidence informing that high confidence, though applied to a slightly different question. See here and here for the more general reasoning/world models which underlie that prediction.) From what I’ve heard of the experiments, it seems that that is indeed basically what happened; therefore the experiments provided approximately-zero new information to my model. They were “useless” in that sense.
(I actually think those experiments were worth running just on the small chance that they’d find something very high value, or more likely that the people running them would have some high-value insight, but I’d still say “probably useless” was a reasonable description beforehand.)
I don’t know if Eliezer would agree with this particular example, but I think this is the sort of thing he’s gesturing at.
That one makes sense (to the extent that Eliezer did confidently predict the results), since the main point of the work was to generate information through experiments. I thought the “predictable” part was also meant to apply to a lot of ML work where the main point is to produce new algorithms, but perhaps it was just meant to apply to things like Ought.
I actually think this particular view is worth fleshing out, since it seems to come up over and over again in discussions of what AI alignment work is valuable (versus not).
For example, it does seem to me that >80% of the work in actually writing a published paper (at least amongst papers at CHAI) (EDIT: no longer believe this on reflection, see Rohin’s comment below) involves doing work with results that are predictable to the author after the concept (for example, actually getting your algorithm to run, writing code for experiments, running said experiments, writing up the results into a paper, etc.)
This just doesn’t match my experience at all. Looking through my past AI papers, I only see two papers where I could predict the results of the experiments on the first algorithm I tried at the beginning of the project. The first one (benefits of assistance) was explicitly meant to be a “communication” paper rather than a “research” paper (at the time of project initiation, rather than in hindsight). The second one (Overcooked) was writing up results that were meant to be the baselines against which the actual unpredictable research (e.g. this) was going to be measured against; it just turned out that that was already sufficiently interesting to the broader community.
(Funny story about the Overcooked paper; we wrote the paper + did the user study in ~two weeks iirc, because it was only two weeks before the deadline that we considered that the “baseline” results might already be interesting enough to warrant a conference paper. It’s now my most-cited AI paper.)
(I’m also not actually sure that I would have predicted the Overcooked results when writing down the first algorithm; the conceptual story felt strong but there are several other papers where the conceptual story felt strong but nonetheless the first thing we tried didn’t work. And in fact we did have to make slight tweaks, like annealing from self-play to BC-play over the course of training, to get our algorithm to work.)
A more typical case would be something like Preferences Implicit in the State of the World, where the conceptual idea never changed over the course of the project, but:
The first hacky / heuristic algorithm we wrote down didn’t work in some cases. We analyzed it a bunch (via experiments) to figure out what sorts of things it wasn’t capturing.
When we eventually had a much more elegant derived-from-math algorithm, I gave a CHAI presentation presenting some experimental results. There were some results I was confused by, where I expected something different from what we got, and I mentioned this. (Specifically these were the results in the case where the robot had a uniform prior over the initial state at time -T). Many people in the room (including at least one person from MIRI) thought for a while and gave their explanation for why this was the behavior you should expect. (I’m pretty sure some even said “this isn’t surprising” or something along those lines.) I remained unconvinced. Upon further investigation we found out that one of Ziebart’s results that we were using had extremely high variance in our setting, since in our setting we only ever had one initial state, rather than sampling several which would give better coverage of the uniform prior. We derived a better version of Ziebart’s result, implemented that, and voila the results were now what I had originally expected.
It took about… 2 weeks (?) between getting this final version of the algorithm and submitting a paper, constituting maybe 15-20% of the total work. Most of that was what I’d call “communication” rather than “research”, e.g. creating another environment to better demonstrate the algorithm’s properties, writing up the paper clearly, making good figures, etc. Good communication seems clearly worth putting effort into.
If you want a deep learning example, consider Learning What To Do by Simulating the Past. The biggest example here is the curriculum—that was not part of the original pseudocode I had written down and was crucial to get it to work.
You might look at this and think that “but the conceptual idea predicted the experiments that were eventually run!” I mean, sure, but then I think your crux is not “were the experiments predictable”, rather it’s “is there any value in going from a conceptual idea to a working implementation”.
It’s also pretty easy to predict the results of experiments in a paper, but that’s because you have the extra evidence that you’re reading a paper. This is super helpful:
The experiments are going to show the algorithm working. They wouldn’t have published the paper otherwise.
The introduction, methods, etc are going to tell you exactly what to expect when you get to the experiments. Even if the authors initially thought the algorithm was going to improve the final score in Atari games, if the algorithm instead improved sample efficiency without changing final score, the introduction is going to be about how the algorithm was inspired by sample efficient learning in humans or whatever.
This is also why I often don’t report on experiments in papers in the Alignment Newsletter; usually the point is just “yes, the conceptual idea worked”.
I don’t know if this is actually true, but one cynical take is that people are used to predicting the results of finished ML work, where they implicitly use (1) and (2) above, and incorrectly conclude that the vast majority of ML experiments are ex ante predictable. And now that they have to predict the outcome of Redwood’s project, before knowing that a paper will result, they implicitly realize that no, it really could go either way. And so they incorrectly conclude that of the ML experiments, Redwood’s project is a rare unpredictable one.
Thanks for the detailed response.
On reflection, I agree with what you said—I think the amount of work it takes to translate a nice sounding idea into anything that actually works on an experimental domain is significant, and what exact work you need is generally not predictable in advance. In particular, I resonated a lot with this paragraph:
At least from my vantage point, “having a strong story for why a result should be X” is insufficient for ex ante predictions of what exactly the results would be. (Once you condition on that being the story told in a paper, however, the prediction task does become trivial.)
I’m now curious what the MIRI response is, as well as how well their intuitive judgments of the results are calibrated.
EDIT: Here’s another toy model I came up with: you might imagine there are two regimes for science—an experiment driven regime, and a theory driven regime. In the former, it’s easy to generate many “plausible sounding” ideas and hard to be justified in holding on to any of them without experiments. The role of scientists is to be (low credence) idea generators and idea testers, and the purpose of experimentation is to primarily to discover new facts that are surprising to the scientist finding them. In the second regime, the key is to come up with the right theory/deep model of AI that predicts lots of facts correctly ex ante, and then the purpose of experiments is to convince other scientists of the correctness of your idea. Good scientists in the second regime are those who discover the right deep models much faster than others. Obviously this is an oversimplification, and no one believes it’s only one or the other, but I suspect both MIRI and Stuart Russell lie more on the “have the right idea, and the paper experiments are there to convince others/apply the idea in a useful domain” view, while most ML researchers hold the more experimentalist view of research?