I was confident that on this very site there would be an example of someone writing an essay with the framing device that it was a blog post from 5 years in the future. Sadly, I only had enough attention span to google “site:lesswrong.com from the future” and click the first link. It was a writing game called Wikipedia Articles from the Future.
My point with this is I’m real pessimistic about generating the AI alignment textbook from 100 years in the future with prompt engineering. Why expect that you’re going to get something far outside the training distribution, rather than the most likely continuation that could have come from the training distribution, which already contains people pretending to be from the future?
I would have been even more pessimistic before Minerva, but even so, we don’t have a couple billion tokens of training data of people completely solving close relatives of the alignment problem to fine-tune on. Minerva is still shocking to me, but it’s clear that an active ingredient in it is having a training distribution that demonstrates many copies of the reasoning you want the AI to do, and few copies of bad reasoning. And if you say the AF is such a dataset I am going to laaaugh.
Thanks for your comment! I agree that we probably won’t be able to get a textbook from the future just by prompting a language model trained on human-generated texts.
As mentioned in the post, maybe one could train a model to also condition on observations. If the model is very powerful, and it really believes the observations, one could make it work. I do think sometimes it would be beneficial for a model to attain superhuman reasoning skills, even if it is only modeling human-written text. Though of course, this might still not happen in practice.
Overall I’m more optimistic about using the model in an IDA-like scheme. One way this might fail on capability grounds is if solving alignment is blocked by a lack of genius-level insights, and if it is hard to get a model to come up with/speed up such insights (e.g. due to a lack of training data containing such insights).
I was confident that on this very site there would be an example of someone writing an essay with the framing device that it was a blog post from 5 years in the future. Sadly, I only had enough attention span to google “site:lesswrong.com from the future” and click the first link. It was a writing game called Wikipedia Articles from the Future.
My point with this is I’m real pessimistic about generating the AI alignment textbook from 100 years in the future with prompt engineering. Why expect that you’re going to get something far outside the training distribution, rather than the most likely continuation that could have come from the training distribution, which already contains people pretending to be from the future?
I would have been even more pessimistic before Minerva, but even so, we don’t have a couple billion tokens of training data of people completely solving close relatives of the alignment problem to fine-tune on. Minerva is still shocking to me, but it’s clear that an active ingredient in it is having a training distribution that demonstrates many copies of the reasoning you want the AI to do, and few copies of bad reasoning. And if you say the AF is such a dataset I am going to laaaugh.
Thanks for your comment! I agree that we probably won’t be able to get a textbook from the future just by prompting a language model trained on human-generated texts.
As mentioned in the post, maybe one could train a model to also condition on observations. If the model is very powerful, and it really believes the observations, one could make it work. I do think sometimes it would be beneficial for a model to attain superhuman reasoning skills, even if it is only modeling human-written text. Though of course, this might still not happen in practice.
Overall I’m more optimistic about using the model in an IDA-like scheme. One way this might fail on capability grounds is if solving alignment is blocked by a lack of genius-level insights, and if it is hard to get a model to come up with/speed up such insights (e.g. due to a lack of training data containing such insights).