“If nobody publishes anything, how will alignment get solved?” — sure, it’s harder for alignment researchers to succeed if they don’t communicate publicly with one another — but it’s not impossible. That’s what dignity is about. A
Huh, I have the opposite intuition. I was about to cite that exact same “Death with dignity” post as an argument for why you are wrong; it’s undignified for us to stop trying to solve the alignment problem and publicly discussing the problem with each other, out of fear that some of our ideas might accidentally percolate into OpenAI and cause them to go slightly faster, and that this increased speedup might have made the difference between victory and defeat. The dignified thing to do is think and talk about the problem.
Obviously keep working, but stop talking where people who are trying to destroy the world can hear. If you’re Neel Nanda and work for a company trying to destroy the world, consider not publishing anything else at all, and only publishing useless versions of your work, because your work being useful for resilient moral alignment depends on a long chain of things that publishing it makes near impossible.
I think there are approximately zero people actively trying to take actions which, according to their own world model, are likely to lead to the destruction of the world. As such, I think it’s probably helpful on the margin to publish stuff of the form “model internals are surprisingly interpretable, and if you want to know if your language model is plotting to overthrow humanity there will probably be tells, here’s where you might want to look”. More generally “you can and should get better at figuring out what’s going on inside models, rather than treating them as black boxes” is probably a good norm to have.
I could see the argument against, for example if you think “LLMs are a dead end on the path to AGI, so the only impact of improvements to their robustness is increasing their usefulness at helping to design the recursively self-improving GOFAI that will ultimately end up taking over the world” or “there exists some group of alignment researchers that is on track to solve both capabilities and alignment such that they can take over the world and prevent anyone else from ending it” or even “people who thing about alignment are likely to have unusually strong insights about capabilities, relative to people who think mostly about capabilities”.
I’m not aware of any arguments that alignment researchers specifically should refrain from publishing that don’t have some pretty specific upstream assumptions like the above though.
Daniel, your interpretation is literally contradicted by Eliezer’s exact words. Eliezer defines dignity as that which increases our chance of survival.
“”Wait, dignity points?” you ask. “What are those? In what units are they measured, exactly?”
And to this I reply: Obviously, the measuring units of dignity are over humanity’s log odds of survival—the graph on which the logistic success curve is a straight line. A project that doubles humanity’s chance of survival from 0% to 0% is helping humanity die with one additional information-theoretic bit of dignity.”
I don’t think our chances of survival will increase if LessWrong becomes substantially more risk-averse about publishing research and musings about AI. I think they will decrease.
Huh, I have the opposite intuition. I was about to cite that exact same “Death with dignity” post as an argument for why you are wrong; it’s undignified for us to stop trying to solve the alignment problem and publicly discussing the problem with each other, out of fear that some of our ideas might accidentally percolate into OpenAI and cause them to go slightly faster, and that this increased speedup might have made the difference between victory and defeat. The dignified thing to do is think and talk about the problem.
Obviously keep working, but stop talking where people who are trying to destroy the world can hear. If you’re Neel Nanda and work for a company trying to destroy the world, consider not publishing anything else at all, and only publishing useless versions of your work, because your work being useful for resilient moral alignment depends on a long chain of things that publishing it makes near impossible.
I think there are approximately zero people actively trying to take actions which, according to their own world model, are likely to lead to the destruction of the world. As such, I think it’s probably helpful on the margin to publish stuff of the form “model internals are surprisingly interpretable, and if you want to know if your language model is plotting to overthrow humanity there will probably be tells, here’s where you might want to look”. More generally “you can and should get better at figuring out what’s going on inside models, rather than treating them as black boxes” is probably a good norm to have.
I could see the argument against, for example if you think “LLMs are a dead end on the path to AGI, so the only impact of improvements to their robustness is increasing their usefulness at helping to design the recursively self-improving GOFAI that will ultimately end up taking over the world” or “there exists some group of alignment researchers that is on track to solve both capabilities and alignment such that they can take over the world and prevent anyone else from ending it” or even “people who thing about alignment are likely to have unusually strong insights about capabilities, relative to people who think mostly about capabilities”.
I’m not aware of any arguments that alignment researchers specifically should refrain from publishing that don’t have some pretty specific upstream assumptions like the above though.
Daniel, your interpretation is literally contradicted by Eliezer’s exact words. Eliezer defines dignity as that which increases our chance of survival.
“”Wait, dignity points?” you ask. “What are those? In what units are they measured, exactly?”
And to this I reply: Obviously, the measuring units of dignity are over humanity’s log odds of survival—the graph on which the logistic success curve is a straight line. A project that doubles humanity’s chance of survival from 0% to 0% is helping humanity die with one additional information-theoretic bit of dignity.”
I don’t think our chances of survival will increase if LessWrong becomes substantially more risk-averse about publishing research and musings about AI. I think they will decrease.