I think we have some underlying disagreements about the nature of the AI we are talking about.
I assume that the AI will not necessarily be based on a sound mathematical system. I expect that the first workable AI systems will be hacked-together systems of heuristics, just like humans are. They can instrumentally use math to formalize problems, just like we can, but I don’t think that they will fundamentally be based on math, or use complex formulas like Bayes without conscious prompting.
I assume that the AI breaking out of the box in my example will already be smart enough to e.g. realize on its own that ethics discussions are more relevant for cheat-identification than cat memes. An AI that is not smart enough to realize this wouldn’t be smart enough to pose a threat, either.
I assume that the AI will not necessarily be based on a sound mathematical system. I expect that the first workable AI systems will be hacked-together systems of heuristics, just like humans are. They can instrumentally use math to formalize problems, just like we can, but I don’t think that they will fundamentally be based on math, or use complex formulas like Bayes without conscious prompting.
I agree that the first AI system might be hacked together. Any AI is based on math in the sense that its fundamental components are doing logical operations. And it only works in reality to the extent that it approximates stuff like bayes theorem. But the difference is whether or not humans have a sufficiently good mathematical understanding of the AI to prove theorems about it. If we have an algorithm which we have a good theoretical understanding of, like min-max in chess, then we don’t call it hacked-together heuristics. If we throw lines of code at a wall and see what sticks, we would call that hacked together heuristics. The difference is that the second is more complicated and less well understood by humans, and has no elegant theorems about it.
You seem to think that your AI alignment proposal might work, and I think it won’t. Do you want to claim that your alignment proposal only works on badly understood AI’s?
I assume that the AI breaking out of the box in my example will already be smart enough to e.g. realize on its own that ethics discussions are more relevant for cheat-identification than cat memes. An AI that is not smart enough to realize this wouldn’t be smart enough to pose a threat, either.
Lets imagine that the AI was able to predict any objective fact about the real world. If the task was “cat identification” then the cat memes would be more relevant. So whether or not ethics discussions are more relevant depends on the definition of “cheat identification”.
If you trained the AI in virtual worlds that contained virtual ethics discussions, and virtual cat memes, then it could learn to pick up the pattern if trained to listen to one and ignore the other.
The information that the AI is supposed to look at ethics discussions and what the programmers say as a source of decisions does not magically appear. There are possible designs of AI that decide what to do based on cat memes.
At some point, something the programmers typed has to have a causal consequence of making the AI look at programmers and ethics discussions not cat memes.
At some point, something the programmers typed has to have a causal consequence of making the AI look at programmers and ethics discussions not cat memes.
No. Or at least not directly. That’s what reinforcement learning is for. I maintain that the AI should be smart enough to figure out on its own that cat memes have less relevance than ethics discussions.
Relevance is not an intrinsic property of the cat memes. You might be specifying it in a very indirect way that leaves the AI to figure a lot of things out, but the information needs to be in there somewhere.
There is a perfectly valid design of AI that decides what to do based on cat memes.
Reinforcement learning doesn’t magic information out of nowhere. All the information is implicit in the choice of neural architecture, hyper-parameters, random seed, training regime and of course training environment. In this case, I suspect you intend to use training environment. So, what enviroment will the AI be trained in, such that the simplest (lowest komelgorov complexity) generalization of a pattern of behaviour that gains high reward in the training environment involves looking at ethics discussions over cat memes?
I am looking for a specific property of the training environment. A pattern, such that when the AI spots and continues that pattern, the resulting behaviour is to take account of our ethical discussions.
I think we have some underlying disagreements about the nature of the AI we are talking about.
I assume that the AI will not necessarily be based on a sound mathematical system. I expect that the first workable AI systems will be hacked-together systems of heuristics, just like humans are. They can instrumentally use math to formalize problems, just like we can, but I don’t think that they will fundamentally be based on math, or use complex formulas like Bayes without conscious prompting.
I assume that the AI breaking out of the box in my example will already be smart enough to e.g. realize on its own that ethics discussions are more relevant for cheat-identification than cat memes. An AI that is not smart enough to realize this wouldn’t be smart enough to pose a threat, either.
I agree that the first AI system might be hacked together. Any AI is based on math in the sense that its fundamental components are doing logical operations. And it only works in reality to the extent that it approximates stuff like bayes theorem. But the difference is whether or not humans have a sufficiently good mathematical understanding of the AI to prove theorems about it. If we have an algorithm which we have a good theoretical understanding of, like min-max in chess, then we don’t call it hacked-together heuristics. If we throw lines of code at a wall and see what sticks, we would call that hacked together heuristics. The difference is that the second is more complicated and less well understood by humans, and has no elegant theorems about it.
You seem to think that your AI alignment proposal might work, and I think it won’t. Do you want to claim that your alignment proposal only works on badly understood AI’s?
Lets imagine that the AI was able to predict any objective fact about the real world. If the task was “cat identification” then the cat memes would be more relevant. So whether or not ethics discussions are more relevant depends on the definition of “cheat identification”.
If you trained the AI in virtual worlds that contained virtual ethics discussions, and virtual cat memes, then it could learn to pick up the pattern if trained to listen to one and ignore the other.
The information that the AI is supposed to look at ethics discussions and what the programmers say as a source of decisions does not magically appear. There are possible designs of AI that decide what to do based on cat memes.
At some point, something the programmers typed has to have a causal consequence of making the AI look at programmers and ethics discussions not cat memes.
No. Or at least not directly. That’s what reinforcement learning is for. I maintain that the AI should be smart enough to figure out on its own that cat memes have less relevance than ethics discussions.
Relevance is not an intrinsic property of the cat memes. You might be specifying it in a very indirect way that leaves the AI to figure a lot of things out, but the information needs to be in there somewhere.
There is a perfectly valid design of AI that decides what to do based on cat memes.
Reinforcement learning doesn’t magic information out of nowhere. All the information is implicit in the choice of neural architecture, hyper-parameters, random seed, training regime and of course training environment. In this case, I suspect you intend to use training environment. So, what enviroment will the AI be trained in, such that the simplest (lowest komelgorov complexity) generalization of a pattern of behaviour that gains high reward in the training environment involves looking at ethics discussions over cat memes?
I am looking for a specific property of the training environment. A pattern, such that when the AI spots and continues that pattern, the resulting behaviour is to take account of our ethical discussions.