Joe Walker has a general conversation with Wolfram about his work and things and stuff, but there are some remarks about AI alignment at the very end:
WALKER: Okay, interesting. So moving finally to AI, many people worry about unaligned artificial general intelligence, and I think it’s a risk we should take seriously. But computational irreducibility must imply that a mathematical definition of alignment is impossible, right?
WOLFRAM: Yes. There isn’t a mathematical definition of what we want AIs to be like. The minimal thing we might say about AIs, about their alignment, is: let’s have them be like people are. And then people immediately say, “No, we don’t want them to be like people. People have all kinds of problems. We want them to be like people aspire to be.
And at that point, you’ve fallen off the cliff. Because, what do people aspire to be? Well, different people aspire to be different and different cultures aspire in different ways. And I think the concept that there will be a perfect mathematical aspiration is just completely wrongheaded. It’s just the wrong type of answer.
The question of how we should be is a question that is a reflection back on us. There is no “this is the way we should be” imposed by mathematics.
Humans have ethical beliefs that are a reflection of humanity. One of the things I realised recently is one of the things that’s confusing about ethics is if you’re used to doing science, you say, “Well, I’m going to separate a piece of the system,” and I’m going to say, “I’m going to study this particular subsystem. I’m going to figure out exactly what happens in the subsystem. Everything else is irrelevant.”
But in ethics, you can never do that. So you imagine you’re doing one of these trolley problem things. You got to decide whether you’re going to kill the three giraffes or the eighteen llamas. And which one is it going to be?
Well, then you realise to really answer that question to the best ability of humanity, you’re looking at the tentacles of the religious beliefs of the tribe in Africa that deals with giraffes, and this kind of thing that was the consequence of the llama for its wool that went in this supply chain, and all this kind of thing.
In other words, one of the problems with ethics is it doesn’t have the separability that we’ve been used to in science. In other words, it necessarily pulls in everything, and we don’t get to say, “There’s this micro ethics for this particular thing; we can solve ethics for this thing without the broader picture of ethics outside.”
If you say, “I’m going to make this system of laws, and I’m going to make the system of constraints on AIs, and that means I know everything that’s going to happen,” well, no, you don’t. There will always be an unexpected consequence. There will always be this thing that spurts out and isn’t what you expected to have happen, because there’s this irreducibility, this kind of inexorable computational process that you can’t readily predict.
The idea that we’re going to have a prescriptive collection of principles for AIs, and we’re going to be able to say, “This is enough, that’s everything we need to constrain the AIs in the way we want,” it’s just not going to happen that way. It just can’t happen that way.
Something I’ve been thinking about recently is, so what the heck do we actually do? I was realising this. We have this connection to ChatGPT, for example, and I was thinking now it can write Wolfram Language code, I can actually run that code on my computer. And right there at the moment where I’m going to press the button that says, “Okay, LLM, whatever code you write, it’s going to run on my computer,” I’m like, “That’s probably a bad idea,” because, I don’t know, it’s going to log into all my accounts everywhere, and it’s going to send you email, and it’s going to tell you this or that thing, and the LLM is in control now.
And I realised that probably it needs some kind of constraints on this. But what constraints should they be? If I say, well, you can’t do anything, you can’t modify any file, then there’s a lot of stuff that would be useful to me that you can’t do.
So there is no set of golden principles that humanity agrees on that are what we aspire to. It’s like, sorry, that just doesn’t exist. That’s not the nature of civilisation. It’s not the nature of our society.
And so then the question is, so what do you do when you don’t have that? And my best current thought is — in fact, I was just chatting with the person I was chatting with before you about this — is developing what are, let’s say, a couple of hundred principles you might pick.
One principle might be, I don’t know: “An AI must always have an owner.” “An AI must always do what its owner tells it to do.” “An AI must, whatever.”
Now you might say, an AI must always have an owner? Is that a principle we want? Is that a principle we don’t want? Some people will pick differently.
But can you at least provide scaffolding for what might be the set of principles that you want? And then it’s like be careful what you wish for because you make up these 200 principles or something, and then you see a few years later, people with placards saying, “Don’t do number 34” or something, and you realise, “Oh, my gosh, what did one set up?”
But I think one needs some kind of framework for thinking about these things, rather than just people saying, “Oh, we want AIs to be virtuous.” Well, what the heck does that mean?
Or, “We have this one particular thing: we want AIs to not do this societally terrible thing right here, but we’re blind to all this other stuff.” None of that is going to work.
You have to have this formalisation of ethics that is such that you can actually pick; you can literally say, I’m going to be running with number 23, number 25, and not number 24, or something. But you’ve got to make that kind of framework.
Steven Wolfram on AI Alignment
Joe Walker has a general conversation with Wolfram about his work and things and stuff, but there are some remarks about AI alignment at the very end: