Several of the points here are premised on needing to do a pivotal act that is way out of distribution from anything the agent has been trained on. But it’s much safer to deploy AI iteratively; increasing the stakes, time horizons, and autonomy a little bit each time.
To do what, exactly, in this nice iterated fashion, before Facebook AI Research destroys the world six months later? What is the weak pivotal act that you can perform so safely?
Human raters make systematic errors—regular, compactly describable, predictable errors.… This is indeed one of the big problems of outer alignment, but there’s lots of ongoing research and promising ideas for fixing it. Namely, using models to help amplify and improve the human feedback signal. Because P!=NP it’s easier to verify proofs than to write them.
When the rater is flawed, cranking up the power to NP levels blows up the P part of the system.
To do what, exactly, in this nice iterated fashion, before Facebook AI Research destroys the world six months later? What is the weak pivotal act that you can perform so safely?
Do alignment & safety research, set up regulatory bodies and monitoring systems.
When the rater is flawed, cranking up the power to NP levels blows up the P part of the system.
Not sure exactly what this means. I’m claiming that you can make raters less flawed, for example, by decomposing the rating task, and providing model-generated critiques that help with their rating. Also, as models get more sample efficient, you can rely more on highly skilled and vetted raters.
My read was that for systems where you have rock-solid checking steps, you can throw arbitrary amounts of compute at searching for things that check out and trust them, but if there’s any crack in the checking steps, then things that ‘check out’ aren’t trustable, because the proposer can have searched an unimaginably large space (from the rater’s perspective) to find them. [And from the proposer’s perspective, the checking steps are the real spec, not whatever’s in your head.]
In general, I think we can get a minor edge from “checking AI work” instead of “generating our own work” and that doesn’t seem like enough to tackle ‘cognitive megaprojects’ (like ‘cure cancer’ or ‘develop a pathway from our current society to one that can reliably handle x-risk’ or so on). Like, I’m optimistic about “current human scientists use software assistance to attempt to cure cancer” and “an artificial scientist attempts to cure cancer” and pretty pessimistic about “current human scientists attempt to check the work of an artificial scientist that is attempting to cure cancer.” It reminds me of translators who complained pretty bitterly about being given machine-translated work to ‘correct’; they basically still had to do it all over again themselves in order to determine whether or not the machine had gotten it right, and so it wasn’t nearly as much of a savings as hoped.
Like the value of ‘DocBot attempts to cure cancer’ is that DocBot can think larger and wider thoughts than humans, and natively manipulate an opaque-to-us dense causal graph of the biochemical pathways in the human body, and so on; if you insist on DocBot only thinking legible-to-human thoughts, then it’s not obvious it will significantly outperform humans.
I did, briefly. I ask that you not do so yourself, or anybody else outside one of the major existing organizations, because I expect that will make things worse as you annoy him and fail to phrase your arguments in any way he’d find helpful.
Other MIRI staff have also chatted with Yann. One co-worker told me that he was impressed with Yann’s clarity of thought on related topics (e.g., he has some sensible, detailed, reductionist models of AI), so I’m surprised things haven’t gone better.
My view is that if Yann continues to be interested in arguing about the issue then there’s something to work with, even if he’s skeptical, and the real worry is if he’s stopped talking to anyone about it (I have no idea personally what his state of mind is right now)
To do what, exactly, in this nice iterated fashion, before Facebook AI Research destroys the world six months later? What is the weak pivotal act that you can perform so safely?
Produce the Textbook From The Future that tells us how to do AGI safely. That said, getting an AGI to generate a correct Foom safety textbook or AGI Textbook from the future would be incredibly difficult, it would be very possible for an AGI to slip in a subtle hard-to-detect inaccuracy that would make it worthless, verifying that it is correct would be very difficult, and getting all humans on earth to follow it would be very difficult.
To do what, exactly, in this nice iterated fashion, before Facebook AI Research destroys the world six months later? What is the weak pivotal act that you can perform so safely?
When the rater is flawed, cranking up the power to NP levels blows up the P part of the system.
Do alignment & safety research, set up regulatory bodies and monitoring systems.
Not sure exactly what this means. I’m claiming that you can make raters less flawed, for example, by decomposing the rating task, and providing model-generated critiques that help with their rating. Also, as models get more sample efficient, you can rely more on highly skilled and vetted raters.
My read was that for systems where you have rock-solid checking steps, you can throw arbitrary amounts of compute at searching for things that check out and trust them, but if there’s any crack in the checking steps, then things that ‘check out’ aren’t trustable, because the proposer can have searched an unimaginably large space (from the rater’s perspective) to find them. [And from the proposer’s perspective, the checking steps are the real spec, not whatever’s in your head.]
In general, I think we can get a minor edge from “checking AI work” instead of “generating our own work” and that doesn’t seem like enough to tackle ‘cognitive megaprojects’ (like ‘cure cancer’ or ‘develop a pathway from our current society to one that can reliably handle x-risk’ or so on). Like, I’m optimistic about “current human scientists use software assistance to attempt to cure cancer” and “an artificial scientist attempts to cure cancer” and pretty pessimistic about “current human scientists attempt to check the work of an artificial scientist that is attempting to cure cancer.” It reminds me of translators who complained pretty bitterly about being given machine-translated work to ‘correct’; they basically still had to do it all over again themselves in order to determine whether or not the machine had gotten it right, and so it wasn’t nearly as much of a savings as hoped.
Like the value of ‘DocBot attempts to cure cancer’ is that DocBot can think larger and wider thoughts than humans, and natively manipulate an opaque-to-us dense causal graph of the biochemical pathways in the human body, and so on; if you insist on DocBot only thinking legible-to-human thoughts, then it’s not obvious it will significantly outperform humans.
If Facebook AI research is such a threat, wouldn’t it be possible to talk to Yann LeCun?
I did, briefly. I ask that you not do so yourself, or anybody else outside one of the major existing organizations, because I expect that will make things worse as you annoy him and fail to phrase your arguments in any way he’d find helpful.
Other MIRI staff have also chatted with Yann. One co-worker told me that he was impressed with Yann’s clarity of thought on related topics (e.g., he has some sensible, detailed, reductionist models of AI), so I’m surprised things haven’t gone better.
Non-MIRI folks have talked to Yann too; e.g., Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More.
What happened?
Nothing much.
There was also a debate between Yann and Stuart Russel on facebook, which got discussed here:
https://www.lesswrong.com/posts/WxW6Gc6f2z3mzmqKs/debate-on-instrumental-convergence-between-lecun-russell
For a more comprehensive writeup of some stuff related to the “annoy him and fail to phrase your arguments helpfully”, see Idea Innoculation and Inferential Distance.
My view is that if Yann continues to be interested in arguing about the issue then there’s something to work with, even if he’s skeptical, and the real worry is if he’s stopped talking to anyone about it (I have no idea personally what his state of mind is right now)
Produce the Textbook From The Future that tells us how to do AGI safely. That said, getting an AGI to generate a correct Foom safety textbook or AGI Textbook from the future would be incredibly difficult, it would be very possible for an AGI to slip in a subtle hard-to-detect inaccuracy that would make it worthless, verifying that it is correct would be very difficult, and getting all humans on earth to follow it would be very difficult.