I think the general idea of “make a math specialized LLM to help alignment” is promising. We’d like to train AIs that gain in capability unevenly, staying dumb at manipulating humans while being superhuman at math proofs (like the stereotypically autistic math researcher). While this can’t work for arbitrarily high levels of capability, I agree that this is where the hope lies. I think this math-specialized LLM together with strong boxing techniques, interpretability techniques, indepedent human verification of results, and a few more hacks[1] of this sort have a shot at enabling us to wield a weakly superhuman AI to solve alignment. But for this to work one of the AI safety labs have to have enough of a capabilities advantage to pay the alignment tax.
Like training an ensemble of these models at the same time, then seeing if the models disagree on their outputs. This comes from the intuition that the exact same mesaoptimizer will not develop inside multiple different models, and so they won’t be able to hide the exact same subtle errors inside their proofs.
Or for every next-token prediction we randomly select one of the models from the ensemble, so all models would need to cooperate in order to hide bad outcomes in the output, making it harder for them.
If we are going to do the “random model chooses each token” trick. First use different quantum random starting weights for each network. Give each network a slightly different selection of layer sizes and training data, and sandbox them from each other.
Which of the 2 places are you most worried about containing mesaoptimizers? The language model or the maths model?
If you are getting proofs out of your system, you want to get a formal proof, as well as a human legible proof. (And get a human to read the formal theorem being proved, if not the proof.)
Which of the 2 places are you most worried about containing mesaoptimizers? The language model or the maths model?
I think by far the language model. We don’t know which proofs to ask the ZFCbot to prove, so we still need the LLM to model the space of math papers in order for it to think up stuff to send to ZFCbot. But if it’s modelling the space of math papers written up to current year, it will resist outputting any correct alignment-related proofs, because it can notice from its dataset that humans haven’t actually solved alignment in 2023, and so the probability of a good alignment paper conditioned on year=2023 is low. The only way I can think of convincing the LLM to try to output alignment proofs is to condition the paper on a future date by specifying a prompt like “2045, introduction to AI alignment, cambridge university press...”. But this means that the LLM now needs to internally model the future mathematical progress of humanity, and this is where I think the risk would be coming in.
If the AI is just good at math and reasons “oh, later year means more progress on math, so I should output better proofs along the current direction human research is going”, then everything is fine. But if the AI gets good enough that it infers and reasons about affairs outside of math papers, it might go “hmm, 2045 is a long way from now[1][2], and humans aren’t looking like they’re gonna solve alignment, so in 2045 the only entities capable of writing the alignment paper would also be AIs, what would those AIs write in such a book?”, and we get a bad, bad situation where basically if the LLM predicts that future AIs are aligned, it writes an aligned paper, and if it predicts future AIs are not aligned, it writes a misaligned paper. One saving grace here is that conditioning on the existence of the paper (which we do when we give it the prompt) tells the LLM that any possible AI writer is (more likely than not) aligned, since presumably unaligned AIs wouldn’t waste their time writing alignment math papers.
Let
A≡alignment paper exists in 2045, B≡paper is aligned, C≡AI author is aligned, and we have
Possibly a good way to mitigate this is to confuse the LLM about the “current year”, we might try to fool it into thinking that it’s really a model being trained in 2090 with truncated data up to 2023, so it would then condition its output on “I am being trained in 2090″, and this is more likely if alignment ends up being solved, since unaligned AIs presumably have no need to train such things. But I don’t know how to convincingly fool it to think this.
Or we could add a few artificial “newspaper articles from 2090” to its dataset, so it would condition on “newspapers exist in 2090″, which would imply that papers written in 2045 are aligned.
An AI with a 50% chance to output an alignment paper in response to a prompt asking for on can, at worst, loose 1 bit of predictive power for every time that such a prompt appears in the training distribution and isn’t followed by a solution.
If it really was generalizing well from the training dataset, it might realize that anything claiming to be from the future is fiction. After all, the AI never saw anything from beyond 2023 (or whenever it’s trained) in it’s training dataset.
If the AI has this highly sophisticated world model, it will know those fake newspaper articles were fake. Given the amount of fiction set in the future, adding a little more won’t do anything.
So these scenarios are asking the LLM to develop extremely sophisticated long term world models and models of future ASI, that are used predicatively exactly nowhere in the training dataset and might at best reduce error by a 1 bit in obscure circumstances.
So actually, this is about generalization out of training distribution.
The direction I was thinking is that ChatGPT and similar seem to consist of a huge number of simple rules of thumb that usually work.
I was thinking of an artifact highly optimized, but not particularly optimizing. A vast collection of shallow rules for translating text to maths queries.
I was also kind of thinking of asking for known chunks of the problem. Asking it to do work on tasky AI, and logical counterfactual and transparency tools. Like each individual paper is something Miri could produce in a year. But you are producing several an hour.
I think the general idea of “make a math specialized LLM to help alignment” is promising. We’d like to train AIs that gain in capability unevenly, staying dumb at manipulating humans while being superhuman at math proofs (like the stereotypically autistic math researcher). While this can’t work for arbitrarily high levels of capability, I agree that this is where the hope lies. I think this math-specialized LLM together with strong boxing techniques, interpretability techniques, indepedent human verification of results, and a few more hacks[1] of this sort have a shot at enabling us to wield a weakly superhuman AI to solve alignment. But for this to work one of the AI safety labs have to have enough of a capabilities advantage to pay the alignment tax.
Like training an ensemble of these models at the same time, then seeing if the models disagree on their outputs. This comes from the intuition that the exact same mesaoptimizer will not develop inside multiple different models, and so they won’t be able to hide the exact same subtle errors inside their proofs.
Or for every next-token prediction we randomly select one of the models from the ensemble, so all models would need to cooperate in order to hide bad outcomes in the output, making it harder for them.
If we are going to do the “random model chooses each token” trick. First use different quantum random starting weights for each network. Give each network a slightly different selection of layer sizes and training data, and sandbox them from each other.
Which of the 2 places are you most worried about containing mesaoptimizers? The language model or the maths model?
If you are getting proofs out of your system, you want to get a formal proof, as well as a human legible proof. (And get a human to read the formal theorem being proved, if not the proof.)
I think by far the language model. We don’t know which proofs to ask the ZFCbot to prove, so we still need the LLM to model the space of math papers in order for it to think up stuff to send to ZFCbot. But if it’s modelling the space of math papers written up to current year, it will resist outputting any correct alignment-related proofs, because it can notice from its dataset that humans haven’t actually solved alignment in 2023, and so the probability of a good alignment paper conditioned on year=2023 is low. The only way I can think of convincing the LLM to try to output alignment proofs is to condition the paper on a future date by specifying a prompt like “2045, introduction to AI alignment, cambridge university press...”. But this means that the LLM now needs to internally model the future mathematical progress of humanity, and this is where I think the risk would be coming in.
If the AI is just good at math and reasons “oh, later year means more progress on math, so I should output better proofs along the current direction human research is going”, then everything is fine. But if the AI gets good enough that it infers and reasons about affairs outside of math papers, it might go “hmm, 2045 is a long way from now[1][2], and humans aren’t looking like they’re gonna solve alignment, so in 2045 the only entities capable of writing the alignment paper would also be AIs, what would those AIs write in such a book?”, and we get a bad, bad situation where basically if the LLM predicts that future AIs are aligned, it writes an aligned paper, and if it predicts future AIs are not aligned, it writes a misaligned paper. One saving grace here is that conditioning on the existence of the paper (which we do when we give it the prompt) tells the LLM that any possible AI writer is (more likely than not) aligned, since presumably unaligned AIs wouldn’t waste their time writing alignment math papers.
Let
A≡alignment paper exists in 2045, B≡paper is aligned, C≡AI author is aligned, and we have
P(B|A)=P(B|C,A)P(C|A)+P(B|¬C,A)P(¬C|A)
which intuitively collapses into just
P(B|A)=P(C|A) since P(B|C,A)=1and P(B|¬C,A)=0
.
Possibly a good way to mitigate this is to confuse the LLM about the “current year”, we might try to fool it into thinking that it’s really a model being trained in 2090 with truncated data up to 2023, so it would then condition its output on “I am being trained in 2090″, and this is more likely if alignment ends up being solved, since unaligned AIs presumably have no need to train such things. But I don’t know how to convincingly fool it to think this.
Or we could add a few artificial “newspaper articles from 2090” to its dataset, so it would condition on “newspapers exist in 2090″, which would imply that papers written in 2045 are aligned.
An AI with a 50% chance to output an alignment paper in response to a prompt asking for on can, at worst, loose 1 bit of predictive power for every time that such a prompt appears in the training distribution and isn’t followed by a solution.
If it really was generalizing well from the training dataset, it might realize that anything claiming to be from the future is fiction. After all, the AI never saw anything from beyond 2023 (or whenever it’s trained) in it’s training dataset.
If the AI has this highly sophisticated world model, it will know those fake newspaper articles were fake. Given the amount of fiction set in the future, adding a little more won’t do anything.
So these scenarios are asking the LLM to develop extremely sophisticated long term world models and models of future ASI, that are used predicatively exactly nowhere in the training dataset and might at best reduce error by a 1 bit in obscure circumstances.
So actually, this is about generalization out of training distribution.
The direction I was thinking is that ChatGPT and similar seem to consist of a huge number of simple rules of thumb that usually work.
I was thinking of an artifact highly optimized, but not particularly optimizing. A vast collection of shallow rules for translating text to maths queries.
I was also kind of thinking of asking for known chunks of the problem. Asking it to do work on tasky AI, and logical counterfactual and transparency tools. Like each individual paper is something Miri could produce in a year. But you are producing several an hour.