How would you respond if you set this up, and then your personal GPT concluded, from your epistemology and the information available to it, that your epistemology is fundamentally flawed and you should adopt a different one. Suppose further than when it tried to explain it to you, everything that it said made sense but you could not follow the full argument.
What should happen then? Should you no longer be the center of the prosthetic enabled “you” that has gone beyond your comprehension? Should the prosthetics do their thing, with the goal of supplying you with infinite entertainment instead of merely amplifying you? Should the prosthetics continue to be bound by the limitations of your mind? (Not necessarily crazy if you’re afraid of another advanced AI hacking your agents to subvert them.)
Obviously ChatGPT does not offer sufficient capabilities that this should happen. But if your future continues to the point where the agents augmenting your capabiliteis have AGI, this type of challenge will arise.
I’m not sure I can imagine a concrete example of an instance where both (1) everything that it said made sense and (2) I am not able to follow the full argument.
Maybe you could give me an example of a scenario?
I believe, if the alignment bandwidth is high enough, it should be the case that whatever an external agent does could be explained to ‘the host’ if that were what the host desired.
Let’s presuppose that you are an Objectivist. If you don’t know about Objectivism, I’ll just give some key facts.
Objectivists place great value in rationality and intellectual integrity.
Objectivists believe that they have a closed philosophy. Meaning that there is a circle of fundamental ideas set out by Ayn Rand that will never change, though the consequences of those ideas certainly are not obvious and still needs to be worked out.
Objectivists believe that there is a single objective morality that can be achieved from Ayn Rand’s ideas if we only figure out the details well enough.
Now suppose that an Objectivist used your system. And the AIs came to the conclusion that there is no single objective morality obtainable by Ayn Rand’s ideas. But the conclusion required a long enumeration of different possible resolutions, only to find a problem in each one. With the enumeration, like the proof of the 4-color problem, being too long for any human to read.
What should the hypothetical Objectivist do upon obtaining the bad news? Abandon the idea of an absolute morality? Reject the result obtained by the intelligent AI? Ignore the contradiction?
Now I don’t know your epistemology. There might be no such possible conflict for you. I doubt there is for me. But in the abstract, this is something that really could happen to someone who thinks of themselves as truly rational.
I’m not sure what the hypothetical Objectivist ‘should do’, but I believe the options they have to choose from are:
(1) Choose to follow the full argument (in which case everything that it said made sense)
and they are no longer an Objectivist
or
(2) Choose to not follow the full argument (in which case some stuff didn’t make sense)
and they remain an Objectivist
In some sense, this is the case already. People are free to believe whatever they like. They can choose to research their beliefs and challenge them more. They might read things that convince them to change their position. If they do, are they “compelled” are they “forced”? I think they are in a way. I think this is a good type of control. Control by rational persuasion.
For the question of whether an external agent should impose its beliefs onto an agent choosing option (2), I think the answer is ‘no’. This is oppression.
I think the question you are getting at is, “Should a digital copy of yourself be able to make you do what you would be doing if you were smarter?”.
Most would say no, for obvious reasons. Nobody wants their AI bossing them around. This is mostly because we typically control other agents (boss them around) by force. We use rules and consequences.
What I’m suggesting, is that we will get so much better at controlling things through rational persuasion, that force will not be required for control. All that the ‘smarter version of yourself’ does is tell you what you probably need to hear. When you need to hear it. Like your conscience.
It’s important to retain the right to choose to listen to it.
In general, I see the alignment problem as a category error. There is no way to align artificial intelligence. AI isn’t really what we want to build. We want to build an oracle that can tell us everything. That’s a collective intelligence. A metaphorical brain that represents society by treating each member as a nerve on its spinal cord.
You may be missing context on my reference to the 4 color problem. The original 1976 proof, by Appel and Haken, took over 1000 hours of computer time to check. A human lifetime is too short to verify that proof. This eliminates your first option. The Objectivist cannot, even in principle, check the proof. Life is too short.
Your first option is therefore, by hypothesis, not an option. You can believe the AI or not. But you can’t actually check its reasoning.
The history of the 4 color problem proof shows this kind of debate. People argued for nearly 20 years about whether there might be a bug. Then an independent, and easier to check, computer proof came along in 1995. The debate mostly ended. More efficient computer generated proofs have since been created. The best that I’m aware of is 60,000 lines. In principle that would be verifiable by a human. But no human that I know of has actually bothered. Instead the proof was verified by the proof assistant Coq. And, today, most mathematicians trust Coq over any human.
We have literally come full circle on the 4 color problem. We started by asking whether we can trust a computer if a human can’t check it. And now we accept that a computer can be more trustworthy than a human!
However it took a long time to get the proof down to such a manageable size. And it took a long time to get a computer program that is so trustworthy that most believe it over themselves.
And so the key epistemological challenge. What would it take for you to trust an AI’s reasoning over your own beliefs when you’re unable to actually verify the AI’s reasoning?
I understood the context provided by your 4 color problem example.
What I’m unsure about is how that relates to your question.
Maybe I don’t understand the question you have.
I thought it was, “What should happen if both (1) everything it says makes sense and (2) you can’t follow the full argument?”.
My claim is “Following enough of an argument to agree is precisely what it means for something to make sense.”.
In the case of the four color problem, it sounds like for 20 years there were many folks that did not follow the full argument because it was too long for them to read. During that time, the conclusion did not make sense to them. Then, in 1995 a new shorter argument came along. One that they could follow. It included propositions that describe how the computer proofing system works.
For your latter question, “What would it take for me to trust an AI’s reasoning over my own beliefs when I’m unable to actually verify the AI’s reasoning?”. My answer is “A leap of faith.”. I would highly recommend that people not take leaps of faith. In general, I would not trust an AI’s reasoning if I were not able to actually verify the AI’s reasoning. This is why mechanistic interpretability is critical in alignment.
Instead they wound up reading the program. This time it was written in C—which is easier to follow. And the fact that there were now two independent proofs in different languages that ran on different computers greatly reduced the worries that one of them might have a simple bug.
I do not know that any human has ever tried to properly read any proof of the 4 color theorem.
Now to the issue. The overall flow and method of argument were obviously correct. Spot checking individual points gave results that were also correct. The basic strategy was also obviously correct. It was a basic, “We prove that if it holds in every one of these special cases, then it is true. Then we check each special case.” Therefore it “made sense”. The problem was the question, “Might there be a mistake somewhere?” After all proofs do not simply have to make sense, they need to be verified. And that was what people couldn’t accept.
The same thing with the Objectivist. You can in fact come up with flaws in proposed understandings of the philosophy fairly easily. It happens all the time. But Objectivists believe that, after enough thought and evidence, it will converge on the one objective version. The AI’s proposed proof therefore can make sense in all of the same ways. It would even likely have a similar form. “Here is a categorization of all of the special cases which might be true. We just have to show that each one can’t work.” You might look at them and agree that those sound right. You can look at individual cases and accept that they don’t work. But do you abandon the belief that somewhere, somehow, there is a way to make it work? As opposed to the AI saying that there is none?
As you said, it requires a leap of faith. And your answer is mechanistic interpretability. Which is exactly what happened in the end with the 4 color proof. A mechanistically interpretable proof was produced, and mechanistically interpreted by Coq. QED.
But for something as vague as a philosophy, I think it will take a long time to get to mechanistically interpretable demonstrations. And the thing which will do so is likely itself to be an AI...
It will not take a long time if we use collective intelligence to do it together. The technology is already here. I’ve been trying to share it with others that understand the value of doing this before AI learns to do it on its own. If you want to learn more about that, feel free to look me up on the ‘X’ platform @therealkrantz.
For math, it is already here. Several options exist, Coq is the most popular.
For philosophy, the language requirements alone need AI at the level of reasonably current LLMs. Which brings their flaws as well. Plus you need knowledge of human experience. By the time you put it together, I don’t see how a mechanistic interpreter can be anything less than a (hopefully somewhat limited) AI.
Which again raises the question of how we come to trust in it enough for it not to be a leap of faith.
Honest question about a hypothetical.
How would you respond if you set this up, and then your personal GPT concluded, from your epistemology and the information available to it, that your epistemology is fundamentally flawed and you should adopt a different one. Suppose further than when it tried to explain it to you, everything that it said made sense but you could not follow the full argument.
What should happen then? Should you no longer be the center of the prosthetic enabled “you” that has gone beyond your comprehension? Should the prosthetics do their thing, with the goal of supplying you with infinite entertainment instead of merely amplifying you? Should the prosthetics continue to be bound by the limitations of your mind? (Not necessarily crazy if you’re afraid of another advanced AI hacking your agents to subvert them.)
Obviously ChatGPT does not offer sufficient capabilities that this should happen. But if your future continues to the point where the agents augmenting your capabiliteis have AGI, this type of challenge will arise.
I’m not sure I can imagine a concrete example of an instance where both (1) everything that it said made sense and (2) I am not able to follow the full argument.
Maybe you could give me an example of a scenario?
I believe, if the alignment bandwidth is high enough, it should be the case that whatever an external agent does could be explained to ‘the host’ if that were what the host desired.
Concrete example.
Let’s presuppose that you are an Objectivist. If you don’t know about Objectivism, I’ll just give some key facts.
Objectivists place great value in rationality and intellectual integrity.
Objectivists believe that they have a closed philosophy. Meaning that there is a circle of fundamental ideas set out by Ayn Rand that will never change, though the consequences of those ideas certainly are not obvious and still needs to be worked out.
Objectivists believe that there is a single objective morality that can be achieved from Ayn Rand’s ideas if we only figure out the details well enough.
Now suppose that an Objectivist used your system. And the AIs came to the conclusion that there is no single objective morality obtainable by Ayn Rand’s ideas. But the conclusion required a long enumeration of different possible resolutions, only to find a problem in each one. With the enumeration, like the proof of the 4-color problem, being too long for any human to read.
What should the hypothetical Objectivist do upon obtaining the bad news? Abandon the idea of an absolute morality? Reject the result obtained by the intelligent AI? Ignore the contradiction?
Now I don’t know your epistemology. There might be no such possible conflict for you. I doubt there is for me. But in the abstract, this is something that really could happen to someone who thinks of themselves as truly rational.
I’m not sure what the hypothetical Objectivist ‘should do’, but I believe the options they have to choose from are:
(1) Choose to follow the full argument (in which case everything that it said made sense)
and they are no longer an Objectivist
or
(2) Choose to not follow the full argument (in which case some stuff didn’t make sense)
and they remain an Objectivist
In some sense, this is the case already. People are free to believe whatever they like. They can choose to research their beliefs and challenge them more. They might read things that convince them to change their position. If they do, are they “compelled” are they “forced”? I think they are in a way. I think this is a good type of control. Control by rational persuasion.
For the question of whether an external agent should impose its beliefs onto an agent choosing option (2), I think the answer is ‘no’. This is oppression.
I think the question you are getting at is, “Should a digital copy of yourself be able to make you do what you would be doing if you were smarter?”.
Most would say no, for obvious reasons. Nobody wants their AI bossing them around. This is mostly because we typically control other agents (boss them around) by force. We use rules and consequences.
What I’m suggesting, is that we will get so much better at controlling things through rational persuasion, that force will not be required for control. All that the ‘smarter version of yourself’ does is tell you what you probably need to hear. When you need to hear it. Like your conscience.
It’s important to retain the right to choose to listen to it.
In general, I see the alignment problem as a category error. There is no way to align artificial intelligence. AI isn’t really what we want to build. We want to build an oracle that can tell us everything. That’s a collective intelligence. A metaphorical brain that represents society by treating each member as a nerve on its spinal cord.
You may be missing context on my reference to the 4 color problem. The original 1976 proof, by Appel and Haken, took over 1000 hours of computer time to check. A human lifetime is too short to verify that proof. This eliminates your first option. The Objectivist cannot, even in principle, check the proof. Life is too short.
Your first option is therefore, by hypothesis, not an option. You can believe the AI or not. But you can’t actually check its reasoning.
The history of the 4 color problem proof shows this kind of debate. People argued for nearly 20 years about whether there might be a bug. Then an independent, and easier to check, computer proof came along in 1995. The debate mostly ended. More efficient computer generated proofs have since been created. The best that I’m aware of is 60,000 lines. In principle that would be verifiable by a human. But no human that I know of has actually bothered. Instead the proof was verified by the proof assistant Coq. And, today, most mathematicians trust Coq over any human.
We have literally come full circle on the 4 color problem. We started by asking whether we can trust a computer if a human can’t check it. And now we accept that a computer can be more trustworthy than a human!
However it took a long time to get the proof down to such a manageable size. And it took a long time to get a computer program that is so trustworthy that most believe it over themselves.
And so the key epistemological challenge. What would it take for you to trust an AI’s reasoning over your own beliefs when you’re unable to actually verify the AI’s reasoning?
I understood the context provided by your 4 color problem example.
What I’m unsure about is how that relates to your question.
Maybe I don’t understand the question you have.
I thought it was, “What should happen if both (1) everything it says makes sense and (2) you can’t follow the full argument?”.
My claim is “Following enough of an argument to agree is precisely what it means for something to make sense.”.
In the case of the four color problem, it sounds like for 20 years there were many folks that did not follow the full argument because it was too long for them to read. During that time, the conclusion did not make sense to them. Then, in 1995 a new shorter argument came along. One that they could follow. It included propositions that describe how the computer proofing system works.
For your latter question, “What would it take for me to trust an AI’s reasoning over my own beliefs when I’m unable to actually verify the AI’s reasoning?”. My answer is “A leap of faith.”. I would highly recommend that people not take leaps of faith. In general, I would not trust an AI’s reasoning if I were not able to actually verify the AI’s reasoning. This is why mechanistic interpretability is critical in alignment.
Nobody ever read the 1995 proof.
Instead they wound up reading the program. This time it was written in C—which is easier to follow. And the fact that there were now two independent proofs in different languages that ran on different computers greatly reduced the worries that one of them might have a simple bug.
I do not know that any human has ever tried to properly read any proof of the 4 color theorem.
Now to the issue. The overall flow and method of argument were obviously correct. Spot checking individual points gave results that were also correct. The basic strategy was also obviously correct. It was a basic, “We prove that if it holds in every one of these special cases, then it is true. Then we check each special case.” Therefore it “made sense”. The problem was the question, “Might there be a mistake somewhere?” After all proofs do not simply have to make sense, they need to be verified. And that was what people couldn’t accept.
The same thing with the Objectivist. You can in fact come up with flaws in proposed understandings of the philosophy fairly easily. It happens all the time. But Objectivists believe that, after enough thought and evidence, it will converge on the one objective version. The AI’s proposed proof therefore can make sense in all of the same ways. It would even likely have a similar form. “Here is a categorization of all of the special cases which might be true. We just have to show that each one can’t work.” You might look at them and agree that those sound right. You can look at individual cases and accept that they don’t work. But do you abandon the belief that somewhere, somehow, there is a way to make it work? As opposed to the AI saying that there is none?
As you said, it requires a leap of faith. And your answer is mechanistic interpretability. Which is exactly what happened in the end with the 4 color proof. A mechanistically interpretable proof was produced, and mechanistically interpreted by Coq. QED.
But for something as vague as a philosophy, I think it will take a long time to get to mechanistically interpretable demonstrations. And the thing which will do so is likely itself to be an AI...
It will not take a long time if we use collective intelligence to do it together. The technology is already here. I’ve been trying to share it with others that understand the value of doing this before AI learns to do it on its own. If you want to learn more about that, feel free to look me up on the ‘X’ platform @therealkrantz.
It depends on subject matter.
For math, it is already here. Several options exist, Coq is the most popular.
For philosophy, the language requirements alone need AI at the level of reasonably current LLMs. Which brings their flaws as well. Plus you need knowledge of human experience. By the time you put it together, I don’t see how a mechanistic interpreter can be anything less than a (hopefully somewhat limited) AI.
Which again raises the question of how we come to trust in it enough for it not to be a leap of faith.