Let’s say we want our EfficientZero-7 to output good alignmentforum blog posts. We have plenty of training data, in terms of the finished product, but we don’t have training data in terms of the “figuring out what to write” part. That part happens in the person’s head.
(Suppose the test data is a post containing Insight X. If we’re training a network to output that post, the network updates can lead to the ability to figure out Insight X, or can lead the network to already know Insight X. Evidence from GPT-3 suggests that the latter is what would actually happen, IMO.)
So then maybe you’ll say: Someone will get the AGI safety researcher to write an alignmentforum blog post while wearing a Kernel Flux brain-scanner helmet, and make EfficientZero-7 build a model from that. But I’m skeptical that the brain-scan data would sufficiently constrain the model so that it would learn how to “figure things out”. Brain scans are too low-resolution, too noisy, and/or too incomplete. I think they would miss pretty much all the important aspects of “figuring things out”.
I think if we had a sufficiently good operationalization of “figuring things out” to train EfficientZero-7, we could just use that to build a “figuring things out” AGI directly instead.
That’s my guess anyway.
Then maybe your response would be: Writing alignmentforum blog posts is a bad example. Instead let’s build silicon-eating nanobots. We can run a slow expensive molecular-dynamics simulation running on a supercomputer, and we can have EfficientZero-7 query it, watch it, build its own “mental model” of what happens in a molecular simulation, and recapitulate that model on cheaper faster GPUs. And we can put in some kind of score that’s maximized when you query the model with the precursors to a silicon-eating nanobot.
I can get behind that kind of story; indeed, I would not be surprised to see papers along those general lines popping up on arxiv tomorrow, or indeed years ago. But would describe that kind of thing as “pivotal acts that require only narrow AI”. I’m not an expert on pivotal acts, and I’m open-minded to the possibility that there are “pivotal acts that require only narrow AI”. And I’m also open-minded to the possibility that we can’t do those acts today, because they require too much querying of expensive-to-query stuff like molecular simulation code or humans or real-world actuation, and that future narrow-AI advances like EfficientZero-7 will solve that problem. I guess I’m modestly skeptical, but I suppose there are unknown unknowns (to me), and certainly I haven’t spent enough time thinking about pivotal acts to have any confidence.
I wasn’t imagining this being a good thing that helps save the world; I was imagining it being a world-ending thing that someone does anyway because they don’t realize how dangerous it is.
I totally agree that the two examples you gave probably wouldn’t work. How about this though:
--Our task will be: Be a chatbot. Talk to users over the course of several months to get them to give you high marks in a user satisfaction survey.
--Pre-train the model on logs of human-to-human chat conversations so you have a reasonable starting point for making predictions about how conversations go.
--Then run the efficientzero algorithm, but with a massively larger parameter count, and talking to hundreds of thousands (millions?) of humans for several years. It would be a very expensive, laggy chatbot (but the user wouldn’t care since they aren’t paying for it and even with lag the text comes in about as fast as a human would reply)
Seems to me this would “work” in the sense that we’d all die within a few years of this happening, on the default trajectory.
In a similar conversation about non-main-actor paths to dangerous AI I came up with this as an example of a path I can imagine being plausible and dangerous: A plausible-to-me worst case scenario would be something like: A phone-scam organization employs someone to build them a online-learning reinforcement learning agent (using an open-source language model as a language-understanding-component) that functions as a scam-helper. It takes in the live transcription of the ongoing conversation between a scammer and a victim, and gives the scammer suggestions for what to say next to persuade the victim to send money. So long as it was even a bit helpful sometimes according to the team of scammers using it, more resources would be given to it and it would continue to collect useful data.
I think this scenario contains a number of dangerous aspects: being illegal and secret, not subject to ethical or safety guidance or regulation deliberately being designed to open-endedly self-improve bringing in incremental resources as it trains to continue to prove its worth (thus not needing a huge initial investment of training cost)
being agentive and directed at the specific goal of manipulating and deceiving humans
I don’t think we need 10 more years of progress in algorithms and compute for this story to be technologically feasible. A crude version of this is possibly already in use, and we wouldn’t know.
Let’s say we want our EfficientZero-7 to output good alignmentforum blog posts. We have plenty of training data, in terms of the finished product, but we don’t have training data in terms of the “figuring out what to write” part. That part happens in the person’s head.
(Suppose the test data is a post containing Insight X. If we’re training a network to output that post, the network updates can lead to the ability to figure out Insight X, or can lead the network to already know Insight X. Evidence from GPT-3 suggests that the latter is what would actually happen, IMO.)
So then maybe you’ll say: Someone will get the AGI safety researcher to write an alignmentforum blog post while wearing a Kernel Flux brain-scanner helmet, and make EfficientZero-7 build a model from that. But I’m skeptical that the brain-scan data would sufficiently constrain the model so that it would learn how to “figure things out”. Brain scans are too low-resolution, too noisy, and/or too incomplete. I think they would miss pretty much all the important aspects of “figuring things out”.
I think if we had a sufficiently good operationalization of “figuring things out” to train EfficientZero-7, we could just use that to build a “figuring things out” AGI directly instead.
That’s my guess anyway.
Then maybe your response would be: Writing alignmentforum blog posts is a bad example. Instead let’s build silicon-eating nanobots. We can run a slow expensive molecular-dynamics simulation running on a supercomputer, and we can have EfficientZero-7 query it, watch it, build its own “mental model” of what happens in a molecular simulation, and recapitulate that model on cheaper faster GPUs. And we can put in some kind of score that’s maximized when you query the model with the precursors to a silicon-eating nanobot.
I can get behind that kind of story; indeed, I would not be surprised to see papers along those general lines popping up on arxiv tomorrow, or indeed years ago. But would describe that kind of thing as “pivotal acts that require only narrow AI”. I’m not an expert on pivotal acts, and I’m open-minded to the possibility that there are “pivotal acts that require only narrow AI”. And I’m also open-minded to the possibility that we can’t do those acts today, because they require too much querying of expensive-to-query stuff like molecular simulation code or humans or real-world actuation, and that future narrow-AI advances like EfficientZero-7 will solve that problem. I guess I’m modestly skeptical, but I suppose there are unknown unknowns (to me), and certainly I haven’t spent enough time thinking about pivotal acts to have any confidence.
I wasn’t imagining this being a good thing that helps save the world; I was imagining it being a world-ending thing that someone does anyway because they don’t realize how dangerous it is.
I totally agree that the two examples you gave probably wouldn’t work. How about this though:
--Our task will be: Be a chatbot. Talk to users over the course of several months to get them to give you high marks in a user satisfaction survey.
--Pre-train the model on logs of human-to-human chat conversations so you have a reasonable starting point for making predictions about how conversations go.
--Then run the efficientzero algorithm, but with a massively larger parameter count, and talking to hundreds of thousands (millions?) of humans for several years. It would be a very expensive, laggy chatbot (but the user wouldn’t care since they aren’t paying for it and even with lag the text comes in about as fast as a human would reply)
Seems to me this would “work” in the sense that we’d all die within a few years of this happening, on the default trajectory.
In a similar conversation about non-main-actor paths to dangerous AI I came up with this as an example of a path I can imagine being plausible and dangerous: A plausible-to-me worst case scenario would be something like:
A phone-scam organization employs someone to build them a online-learning reinforcement learning agent (using an open-source language model as a language-understanding-component) that functions as a scam-helper. It takes in the live transcription of the ongoing conversation between a scammer and a victim, and gives the scammer suggestions for what to say next to persuade the victim to send money. So long as it was even a bit helpful sometimes according to the team of scammers using it, more resources would be given to it and it would continue to collect useful data.
I think this scenario contains a number of dangerous aspects:
being illegal and secret, not subject to ethical or safety guidance or regulation
deliberately being designed to open-endedly self-improve
bringing in incremental resources as it trains to continue to prove its worth (thus not needing a huge initial investment of training cost)
being agentive and directed at the specific goal of manipulating and deceiving humans
I don’t think we need 10 more years of progress in algorithms and compute for this story to be technologically feasible. A crude version of this is possibly already in use, and we wouldn’t know.