Re warning shots (#4): I worry that reality is already giving us lots of warning shots, and we’re failing to learn much of anything from them. Like, rather than forming a generalization that AI is kind of weird and difficult to direct, so it could be a real problem if it gets super powerful, we’re mostly just sort of muddling through, and whenever something bad happens, we just say “oh, well that didn’t kill us, so let’s just keep on going.”
Just to pick on a few of your examples:
Some company “accidentally” violates copyright by training AI and get sued for it.
If you’ll permit me to anthropomorphize Reality for a minute, here’s what it would be saying:
Wow, I’ve been so good about giving the humans loads of low-stakes warning shots to work from. I threw them that RL agent driving a speedboat in circles early on. In fact, I gave them so many examples of RL doing weird stuff that I’m sure it’s common knowledge amongst RL researchers that you have to be careful about how you reward your agent. And humans communicate and tell each other things, so I’m sure by this point everybody knows about that.
Then when they started building really good language models, I made it so that putting the wrong input into them would make them reveal their prompt, or say racist things, or other really terrible stuff that’s even worse. But I was concerned that wouldn’t be an obvious enough sign, since those failures only crop up if the human user is trying to make them happen. So when GPT-4 showed up on the scene, as Bing, I took the opportunity to do one better: Bing sometimes spontaneously insulted, gaslit, and was generally a jerk to its users. without anybody deliberately prompting it to do that.
What a good and nice and friendly Reality I am, giving these humans so much advance warning about AI stuff. It must be like playing a game on easy mode.
I wrote a bit more which isn’t too related to the rest of the comment, but I couldn’t resist:
Me: Hey, Reality, sorry to jump in here, but let’s say we’ve picked up the warning. What do we do then? How do we solve alignment?
Reality: You really don’t have to stress about that, man. Once you’ve realized the danger based on the numerous helpful hints I’ve provided, you’ve got all the time in the world to figure out how to solve alignment. You can just not build AI while you’re working on the problem. Don’t worry about your bros dying, you can freeze all your peeps in liquid nitrogen and bring ’em back when you figure it out. I’ve arranged for there to already exist pre-existing expertise in doing exactly that, and pre-existing companies offering it as a service.
Me: Cool. When we do get around to trying to solve alignment, though, do you have any helpful tips on how to actually go about that?
Reality: I’ve got you covered, dude! See, in the platonic structure of mathematics, there’s totally a perfectly consistent formulation of Updateless Decision Theory. It describes an ideal bounded rational agent, that handles embeddedness, cooperation, and logical uncertainty perfectly fine. Just keep working on that agent foundations research, you’re closer than you think! Even the platonic structure of mathematics is on your side, see? We all want you to be successful here.
Me: Wow, that’s great news! And then what do we do about value-loading?
Reality: Value-loading is even easier. You just write down your utility function, and put it in the agent you’re building. Boom, done! Nothing could be simpler.
Me: So you’re saying I just need to, uh, write down my utility function?....
Reality: Yep, exactly. You know all your dreams and goals, your love for your friends, family, and indeed humanity as a whole? Everything that is good and bright in the world, everything that you’re trying to achieve in your life? You just write that all down as a big mathematical function, and then stick it in your artificial agent. Nothing could be simpler!
Me: Are there any alternatives to that? Say if I was trying to align the AI to someone else, whose utility function I didn’t know? Not me obviously, I certainly know my own utility function (cough), just say I was trying to align the AI for a friend, as it were. (ahem)
Reality: You could ask them to write down their utility function and then stick it in the agent. There’s this rad theorem that says that your friend should always be willing to do this: Given a few reasonable assumptions, creating a new agent that shares your own utility function should always be desirable to any given agent. Now your friend may not want you finding out their utility function, but there’s cryptographic techniques that can be used for that...
Me: Yeah, I think I know about that theorem. However, for certain complicated reasons that won’t work for m… my friend. I, uh, can’t say why. Any other options?
Reality: Yes, there is one other option: See, one thing that would always work is you could always just copy the utility function directly out of their minds. CTL-C, CTL-V, boom, done! See, I even gave you the keyboard shortcuts to use! Wow, I can’t believe how helpful I’m being right now!!!!...
I am mostly agreeing with you here, so I am not sure you understood my original point. Yes Reality is giving us things that for a set of reasonable people such as you and me should be warning shots.
Since a lot of other people don’t react to them, you might become pessimistic and extrapolate that NO warning shot is going to be good enough. However I posit that SOME warning shots are going to be good enough. An AI—driven bank run followed by an economic collapse is one example, but there could be others. Generally I expect that when warning shots reach “nation-level” socio-economic problems, people will pay attention.
Thanks for the reply, I think we do mostly agree here. Some points of disagreement might be that I’m not at all confident that we get a truly large scale warning shot before AI gets powerful enough to just go and kill everyone. Like I think the threshold for what would really get people paying attention is above “there is a financial disaster”, I’m guessing it would actually take AI killing multiple people (outside of a self-driving context). That could totally happen before doom, but it could also totally fail to happen. We probably get a few warning shots that are at least bigger than all the ones we’ve had before, but I can’t even predict that with much confidence.
Yes I think we understand each other. One thing to keep in mind is that different stakeholders in AI are NOT utilitarians, they have local incentives they individually care about. Given the fact that COVID didn’t stop gain-of-function research, this means that getting EVERYONE to care would require a death toll larger than COVID. However, getting someone like CEO of google to care would “only” require a half—a - trillion dollar lawsuit against Microsoft for some issue relating to their AIs.
And I generally expect those—types of warning shots to be pretty likely given how gun-ho the current approach is.
Re warning shots (#4): I worry that reality is already giving us lots of warning shots, and we’re failing to learn much of anything from them. Like, rather than forming a generalization that AI is kind of weird and difficult to direct, so it could be a real problem if it gets super powerful, we’re mostly just sort of muddling through, and whenever something bad happens, we just say “oh, well that didn’t kill us, so let’s just keep on going.”
Just to pick on a few of your examples:
https://www.newscientist.com/article/2346217-microsofts-copilot-code-tool-faces-the-first-big-ai-copyright-lawsuit/
https://www.businessinsider.com/widow-accuses-ai-chatbot-reason-husband-kill-himself-2023-4?op=1 (There was no nefarious human behind the AI in this case, so it should count double, no?)
If you’ll permit me to anthropomorphize Reality for a minute, here’s what it would be saying:
I wrote a bit more which isn’t too related to the rest of the comment, but I couldn’t resist:
I am mostly agreeing with you here, so I am not sure you understood my original point. Yes Reality is giving us things that for a set of reasonable people such as you and me should be warning shots.
Since a lot of other people don’t react to them, you might become pessimistic and extrapolate that NO warning shot is going to be good enough. However I posit that SOME warning shots are going to be good enough. An AI—driven bank run followed by an economic collapse is one example, but there could be others. Generally I expect that when warning shots reach “nation-level” socio-economic problems, people will pay attention.
However, this will happen before doom.
Thanks for the reply, I think we do mostly agree here. Some points of disagreement might be that I’m not at all confident that we get a truly large scale warning shot before AI gets powerful enough to just go and kill everyone. Like I think the threshold for what would really get people paying attention is above “there is a financial disaster”, I’m guessing it would actually take AI killing multiple people (outside of a self-driving context). That could totally happen before doom, but it could also totally fail to happen. We probably get a few warning shots that are at least bigger than all the ones we’ve had before, but I can’t even predict that with much confidence.
Yes I think we understand each other. One thing to keep in mind is that different stakeholders in AI are NOT utilitarians, they have local incentives they individually care about. Given the fact that COVID didn’t stop gain-of-function research, this means that getting EVERYONE to care would require a death toll larger than COVID. However, getting someone like CEO of google to care would “only” require a half—a - trillion dollar lawsuit against Microsoft for some issue relating to their AIs.
And I generally expect those—types of warning shots to be pretty likely given how gun-ho the current approach is.