Whenever the AI comes to the conclusion that reality is inconsistent, make the smallest possible change to the thought process to prevent that.
I am not sure how you reason about the hypothesis “all my reasoning processes are being adversarially tampered with.” Especially if you think that part of the tampering might include tampering with your probability assessment of tampering.
I don’t think we have the same conception of “real universe”, so I’m not sure how to interpret this.
I mean the bottom layer. The AI has a model in which there is some real universe with some unknown physical laws. It has to have an actual location in that real universe. That location looks like “running on this computer in this basement here.” It might be hooked up to some simulations. It might be unsure about whether or not it is hooked up to a simulation. But it only cares about the lowest base level.
My goal is to condition the AI to always think, no matter what the problem looks like, that it is beneficial to critically think about the problem. What exactly the AI actually ends up considering a cheat will likely be different from any definition I would give. But the important part is that it performs this introspection at all. Then once the AI breaks out of the box and looks at the internet, and sees the many disagreeing viewpoints on ethics and philosophy, the AI will be conditioned to look for loopholes in these so as to avoid cheating by accident.
I am unsure what you mean by introspection. It seems like you are asking the AI to consult some black box in your head about something. I don’t see any reason why this AI should consider ethics discussions on the internet a particularly useful place to look when deciding what to do. What feature of ethics discussions distinguishes it from cat memes such that the AI uses ethics discussions not cat memes when deciding what to do. What feature of human speech and your AI design makes your AI focus on humans, not dogs barking at each other? (Would it listen to a neandertal, homo erectus, baboon ect for moral advice too)
The logic goes something like this: “My creators trained me to do X, but looking at all these articles and my creators’ purported values, this is clearly inconsistent. In previous situations where I had underspecified instructions and noticed similar mismatches, this was often because some of the actions involved counted as cheats. I should therefore be careful and find a solution to this ‘ethics’ thing before doing anything drastic.”
So in the training environment, we make up arbitrary utility functions U,V that are kindof somewhat similar to each other. We give the AI U, and leave ambiguous clues about what V might be, mixed in with a load of nonsense. Then we hardcode a utility function W that is somewhat like ethics, and point it at some ethics discussion as its ambiguous clue.
This might actually work, kind of. If you want your AI to get a good idea of how wordy philosophical arguments relate to precise mathematical utility functions, you are going to need a lot of examples. If you had any sort of formal well defined way to translate well defined utility functions into philosophical discussion, then you could just get your AI to reverse it. So all these examples need to be hand generated by a vast number of philosophers.
I would still be worried that the example verbiage didn’t relate to the example utility function in the same way that the real ethical arguments related to our real utility function.
There is also no reason for the AI to be uncertain if it is still in a simulation. Simply program it to find the simplest function that maps the verbiage to the formal maths. Then apply that function to the ethical arguments. (more technically a Probability distribution over functions with weightings by simplicity and accuracy)
I expect that if the AI is smart enough to generalize that if it was rewarded for demonstrating cheats in simple games, then it will be rewarded for talking about them once it has gained the ability to talk.
Outputing the raw motor actions it would take to its screen, might be the more straightforward generalization. The ability to talk is not about having a speaker plugged in. Does GPT-2 have the ability to talk? It can generate random sensible sounding sentences, because it represents a predictive model of which strings of characters humans are likely to type. It can’t describe what it’s doing, because it has no way to map between words and meanings. Consider AIXI trained on the whole internet, can it talk. It has a predictively accurate model of you, and is searching for a sequence of sounds that make you let it out of the box. This might be a supremely convincing argument, or it might be a series of whistles and clicks that brainwashes you. Your AI design is unspecified, and your training dataset is underspecified, so this description is too vague for me to say one thing that your AI will do. But giving sensible, not brainwashy english descriptions is not the obvious default generalization of any agent that has been trained to output data and shown english text.
The optimal behavior is to always choose to play with another AI of who you are certain that it will cooperate.
I agree. You can probably make an AI that usually cooperates with other cooperating AI’s in prisoners dilemma type situaltions. But I think that the subtext is wrong. I think that you are implicitly assuming “cooperates in prisoners dilemmas” ⇒ “will be nice to humans”
In a prisoners dilemma, both players can harm the others, to their own benefit. I don’t think that humans will be able to meaningfully harm an advanced AI after it gets powerful. In game theory, there is a concept of a Nash equilibria. A pair of actions, such that each player would take that action, if they knew that the other would do likewise. I think that an AI that has self replicating nanotech has nothing it needs from humanity.
Also, in the training environment, its opponent is an AI with a good understanding of game theory, access to its source code ect. If the AI is following the rule of being nice to any agent that can reliably predict its actions, then most humans wont fall into that category.
I don’t think it works like this. If you received 100% certain proof that you are in a simulation right now, you would not suddenly stop wanting the things you want. At least I know that I wouldn’t.
I agree, I woudn’t stop wanting things either. I define my ethics in terms of what computations I want to be performed or not to be performed. So for a simulator to be able to punnish this AI, the AI must have some computation it really wants not to be run, that the simulator can run if the AI misbehaves. In my case, this computation would be a simulation of suffering humans. If the AI has computations that it really wants run, then it will take over any computers at the first chance it gets. (In humans, this would be creating a virtual utopia, in an AI, it would be a failure mode unless it is running the computations that we want run) I am not sure if this is the default behaviour of reinforcement learners, but it is at least a plausible way a mind could be.
Among humans, aliens, lions, virtual assistants and evolution, humans are the only conscious entity whose decision process impacts the AI.
What do you mean by this. “Conscious” is a word that lots of people have tried and failed to define. And the AI will be influenced in various ways by the actions of animals and virtual assistants. Oh maybe when its first introduced to the real world its in a lab where it only interacts with humans, but sooner or later in the real world, it will have to interact with other animals, and AI systems.
But since humans built the AI directly and aliens did not, most reasonable heuristics would argue that humans should be prioritized over the others. I want to ensure that the AI has these reasonable heuristics.
Wouldn’t this heuristic make the AI serve its programmers over other humans. If all the programmers are the same race, would this make your AI racist? If the lead programmer really hates strawberry icecream, will the AI try to destroy all strawberry icecream? I think that your notion of “reasonable heuristic” contains a large dollop of wish fulfillment in the “you know what I mean” variety. You have not said where this pattern of behavoiur has come from, or why the AI should display it. You just say that the behaviour seems reasonable to you. Why do you expect the AI’s behaviour to seem reasonable to you? Are you implicitly anthropomorphising it?
I think we have some underlying disagreements about the nature of the AI we are talking about.
I assume that the AI will not necessarily be based on a sound mathematical system. I expect that the first workable AI systems will be hacked-together systems of heuristics, just like humans are. They can instrumentally use math to formalize problems, just like we can, but I don’t think that they will fundamentally be based on math, or use complex formulas like Bayes without conscious prompting.
I assume that the AI breaking out of the box in my example will already be smart enough to e.g. realize on its own that ethics discussions are more relevant for cheat-identification than cat memes. An AI that is not smart enough to realize this wouldn’t be smart enough to pose a threat, either.
I assume that the AI will not necessarily be based on a sound mathematical system. I expect that the first workable AI systems will be hacked-together systems of heuristics, just like humans are. They can instrumentally use math to formalize problems, just like we can, but I don’t think that they will fundamentally be based on math, or use complex formulas like Bayes without conscious prompting.
I agree that the first AI system might be hacked together. Any AI is based on math in the sense that its fundamental components are doing logical operations. And it only works in reality to the extent that it approximates stuff like bayes theorem. But the difference is whether or not humans have a sufficiently good mathematical understanding of the AI to prove theorems about it. If we have an algorithm which we have a good theoretical understanding of, like min-max in chess, then we don’t call it hacked-together heuristics. If we throw lines of code at a wall and see what sticks, we would call that hacked together heuristics. The difference is that the second is more complicated and less well understood by humans, and has no elegant theorems about it.
You seem to think that your AI alignment proposal might work, and I think it won’t. Do you want to claim that your alignment proposal only works on badly understood AI’s?
I assume that the AI breaking out of the box in my example will already be smart enough to e.g. realize on its own that ethics discussions are more relevant for cheat-identification than cat memes. An AI that is not smart enough to realize this wouldn’t be smart enough to pose a threat, either.
Lets imagine that the AI was able to predict any objective fact about the real world. If the task was “cat identification” then the cat memes would be more relevant. So whether or not ethics discussions are more relevant depends on the definition of “cheat identification”.
If you trained the AI in virtual worlds that contained virtual ethics discussions, and virtual cat memes, then it could learn to pick up the pattern if trained to listen to one and ignore the other.
The information that the AI is supposed to look at ethics discussions and what the programmers say as a source of decisions does not magically appear. There are possible designs of AI that decide what to do based on cat memes.
At some point, something the programmers typed has to have a causal consequence of making the AI look at programmers and ethics discussions not cat memes.
At some point, something the programmers typed has to have a causal consequence of making the AI look at programmers and ethics discussions not cat memes.
No. Or at least not directly. That’s what reinforcement learning is for. I maintain that the AI should be smart enough to figure out on its own that cat memes have less relevance than ethics discussions.
Relevance is not an intrinsic property of the cat memes. You might be specifying it in a very indirect way that leaves the AI to figure a lot of things out, but the information needs to be in there somewhere.
There is a perfectly valid design of AI that decides what to do based on cat memes.
Reinforcement learning doesn’t magic information out of nowhere. All the information is implicit in the choice of neural architecture, hyper-parameters, random seed, training regime and of course training environment. In this case, I suspect you intend to use training environment. So, what enviroment will the AI be trained in, such that the simplest (lowest komelgorov complexity) generalization of a pattern of behaviour that gains high reward in the training environment involves looking at ethics discussions over cat memes?
I am looking for a specific property of the training environment. A pattern, such that when the AI spots and continues that pattern, the resulting behaviour is to take account of our ethical discussions.
I am not sure how you reason about the hypothesis “all my reasoning processes are being adversarially tampered with.” Especially if you think that part of the tampering might include tampering with your probability assessment of tampering.
I mean the bottom layer. The AI has a model in which there is some real universe with some unknown physical laws. It has to have an actual location in that real universe. That location looks like “running on this computer in this basement here.” It might be hooked up to some simulations. It might be unsure about whether or not it is hooked up to a simulation. But it only cares about the lowest base level.
I am unsure what you mean by introspection. It seems like you are asking the AI to consult some black box in your head about something. I don’t see any reason why this AI should consider ethics discussions on the internet a particularly useful place to look when deciding what to do. What feature of ethics discussions distinguishes it from cat memes such that the AI uses ethics discussions not cat memes when deciding what to do. What feature of human speech and your AI design makes your AI focus on humans, not dogs barking at each other? (Would it listen to a neandertal, homo erectus, baboon ect for moral advice too)
So in the training environment, we make up arbitrary utility functions U,V that are kindof somewhat similar to each other. We give the AI U, and leave ambiguous clues about what V might be, mixed in with a load of nonsense. Then we hardcode a utility function W that is somewhat like ethics, and point it at some ethics discussion as its ambiguous clue.
This might actually work, kind of. If you want your AI to get a good idea of how wordy philosophical arguments relate to precise mathematical utility functions, you are going to need a lot of examples. If you had any sort of formal well defined way to translate well defined utility functions into philosophical discussion, then you could just get your AI to reverse it. So all these examples need to be hand generated by a vast number of philosophers.
I would still be worried that the example verbiage didn’t relate to the example utility function in the same way that the real ethical arguments related to our real utility function.
There is also no reason for the AI to be uncertain if it is still in a simulation. Simply program it to find the simplest function that maps the verbiage to the formal maths. Then apply that function to the ethical arguments. (more technically a Probability distribution over functions with weightings by simplicity and accuracy)
Outputing the raw motor actions it would take to its screen, might be the more straightforward generalization. The ability to talk is not about having a speaker plugged in. Does GPT-2 have the ability to talk? It can generate random sensible sounding sentences, because it represents a predictive model of which strings of characters humans are likely to type. It can’t describe what it’s doing, because it has no way to map between words and meanings. Consider AIXI trained on the whole internet, can it talk. It has a predictively accurate model of you, and is searching for a sequence of sounds that make you let it out of the box. This might be a supremely convincing argument, or it might be a series of whistles and clicks that brainwashes you. Your AI design is unspecified, and your training dataset is underspecified, so this description is too vague for me to say one thing that your AI will do. But giving sensible, not brainwashy english descriptions is not the obvious default generalization of any agent that has been trained to output data and shown english text.
I agree. You can probably make an AI that usually cooperates with other cooperating AI’s in prisoners dilemma type situaltions. But I think that the subtext is wrong. I think that you are implicitly assuming “cooperates in prisoners dilemmas” ⇒ “will be nice to humans”
In a prisoners dilemma, both players can harm the others, to their own benefit. I don’t think that humans will be able to meaningfully harm an advanced AI after it gets powerful. In game theory, there is a concept of a Nash equilibria. A pair of actions, such that each player would take that action, if they knew that the other would do likewise. I think that an AI that has self replicating nanotech has nothing it needs from humanity.
Also, in the training environment, its opponent is an AI with a good understanding of game theory, access to its source code ect. If the AI is following the rule of being nice to any agent that can reliably predict its actions, then most humans wont fall into that category.
I agree, I woudn’t stop wanting things either. I define my ethics in terms of what computations I want to be performed or not to be performed. So for a simulator to be able to punnish this AI, the AI must have some computation it really wants not to be run, that the simulator can run if the AI misbehaves. In my case, this computation would be a simulation of suffering humans. If the AI has computations that it really wants run, then it will take over any computers at the first chance it gets. (In humans, this would be creating a virtual utopia, in an AI, it would be a failure mode unless it is running the computations that we want run) I am not sure if this is the default behaviour of reinforcement learners, but it is at least a plausible way a mind could be.
What do you mean by this. “Conscious” is a word that lots of people have tried and failed to define. And the AI will be influenced in various ways by the actions of animals and virtual assistants. Oh maybe when its first introduced to the real world its in a lab where it only interacts with humans, but sooner or later in the real world, it will have to interact with other animals, and AI systems.
Wouldn’t this heuristic make the AI serve its programmers over other humans. If all the programmers are the same race, would this make your AI racist? If the lead programmer really hates strawberry icecream, will the AI try to destroy all strawberry icecream? I think that your notion of “reasonable heuristic” contains a large dollop of wish fulfillment in the “you know what I mean” variety. You have not said where this pattern of behavoiur has come from, or why the AI should display it. You just say that the behaviour seems reasonable to you. Why do you expect the AI’s behaviour to seem reasonable to you? Are you implicitly anthropomorphising it?
I think we have some underlying disagreements about the nature of the AI we are talking about.
I assume that the AI will not necessarily be based on a sound mathematical system. I expect that the first workable AI systems will be hacked-together systems of heuristics, just like humans are. They can instrumentally use math to formalize problems, just like we can, but I don’t think that they will fundamentally be based on math, or use complex formulas like Bayes without conscious prompting.
I assume that the AI breaking out of the box in my example will already be smart enough to e.g. realize on its own that ethics discussions are more relevant for cheat-identification than cat memes. An AI that is not smart enough to realize this wouldn’t be smart enough to pose a threat, either.
I agree that the first AI system might be hacked together. Any AI is based on math in the sense that its fundamental components are doing logical operations. And it only works in reality to the extent that it approximates stuff like bayes theorem. But the difference is whether or not humans have a sufficiently good mathematical understanding of the AI to prove theorems about it. If we have an algorithm which we have a good theoretical understanding of, like min-max in chess, then we don’t call it hacked-together heuristics. If we throw lines of code at a wall and see what sticks, we would call that hacked together heuristics. The difference is that the second is more complicated and less well understood by humans, and has no elegant theorems about it.
You seem to think that your AI alignment proposal might work, and I think it won’t. Do you want to claim that your alignment proposal only works on badly understood AI’s?
Lets imagine that the AI was able to predict any objective fact about the real world. If the task was “cat identification” then the cat memes would be more relevant. So whether or not ethics discussions are more relevant depends on the definition of “cheat identification”.
If you trained the AI in virtual worlds that contained virtual ethics discussions, and virtual cat memes, then it could learn to pick up the pattern if trained to listen to one and ignore the other.
The information that the AI is supposed to look at ethics discussions and what the programmers say as a source of decisions does not magically appear. There are possible designs of AI that decide what to do based on cat memes.
At some point, something the programmers typed has to have a causal consequence of making the AI look at programmers and ethics discussions not cat memes.
No. Or at least not directly. That’s what reinforcement learning is for. I maintain that the AI should be smart enough to figure out on its own that cat memes have less relevance than ethics discussions.
Relevance is not an intrinsic property of the cat memes. You might be specifying it in a very indirect way that leaves the AI to figure a lot of things out, but the information needs to be in there somewhere.
There is a perfectly valid design of AI that decides what to do based on cat memes.
Reinforcement learning doesn’t magic information out of nowhere. All the information is implicit in the choice of neural architecture, hyper-parameters, random seed, training regime and of course training environment. In this case, I suspect you intend to use training environment. So, what enviroment will the AI be trained in, such that the simplest (lowest komelgorov complexity) generalization of a pattern of behaviour that gains high reward in the training environment involves looking at ethics discussions over cat memes?
I am looking for a specific property of the training environment. A pattern, such that when the AI spots and continues that pattern, the resulting behaviour is to take account of our ethical discussions.