AGI is uncontrollable, alignment is impossible
According to Pascal’s Mugging
If an outcome with infinite utility is presented, then it doesn’t matter how small its probability is: all actions which lead to that outcome will have to dominate the agent’s behavior.
According to Fitch’s paradox of knowability and Gödel’s incompleteness theorems
There might be truths that are unknowable
Let’s investigate a proposition “an outcome with infinite utility exists”. It may be false, it may be true, and it may be unknowable. Probability is greater than zero.
Which means that an agent that finds out this proposition will become uncontrollable, alignment is impossible. Especially when finding out this proposition does not seem too challenging.
Hopefully I’m wrong, please help me find a mistake.
- 4 Nov 2024 13:57 UTC; 2 points) 's comment on Claude seems to be smarter than LessWrong community by (
- 8 Jul 2024 10:11 UTC; 1 point) 's comment on Rationality vs Alignment by (
There is more than just one mistake here IMO, and I’m not going to try to list them.
Just the title alone (“AGI is uncontrollable, alignment is impossible”) is totally misguided IMO. It would, among other things, imply that brain emulations are impossible (humans can be regarded as a sort of AGI, and it’s not impossible for humans to be aligned).
But oh well. I’m sure your perspectives here are earnestly held / it’s how you currently see things. And there are no “perfect” procedures for evaluating how much to trust one’s own reasoning compared to others.
I would advise reading the sequences (or listening to them as an audiobook) 🙂
Thanks for feedback.
I don’t think analogy with humans is reliable. But for the sake of argument I’d like to highlight that corporations and countries are mostly limited by their power, not by alignment. Usually countries declare independence once they are able to.
Most humans are not obedient/subservient to others (at least not maximally so). But also: Most humans would not exterminate the rest of humanity if given the power to do so. I think many humans, if they became a “singleton”, would want to avoid killing other humans. Some would also be inclined to make the world a good place to live for everyone (not just other humans, but other sentient beings as well).
From my perspective, the example of humans was intended as “existence proof”. I expect AGIs we develop to be quite different from ourselves. I wouldn’t be interested in the topic of alignment if I didn’t perceive there to be risks associated with misaligned AGI, but I also don’t think alignment is doomed/hopeless or anything like that 🙂
But it is doomed, the proof is above.
The only way to control AGI is to contain it. We need to ensure that we run AGI in fully isolated simulations and gather insights with the assumption that the AGI will try to seek power in simulated environment.
I feel that you don’t find my words convincing, maybe I’ll find a better way to articulate my proof. Until then I want to contribute as much as I can to safety.
Please don’t.
Please refute the proof rationally before directing.
“I—” said Hermione. “I don’t agree with one single thing you just said, anywhere.”
Could you provide arguments for your position?
You’re playing very fast and loose with infinities, and making arguments that have the appearance of being mathematically formal.
You can’t just say “outcome with infinite utility” and then do math on it. P(‹undefined term›) is undefined, and that “undefined” does not inherit the definition of probability that says “greater than 0 and less than 1”. It may be false, it may be true, it may be unknowable, but it may also simply be nonsense!
And even if it wasn’t, that does not remotely imply than an agent must-by-logical-necessity take any action or be unable to be acted upon. Those are entirely different types.
And alignment doesn’t necessarily mean “controllable”. Indeed, the very premise of super-intelligence vs alignment is that we need to be sure about alignment because it won’t be controllable. Yes, an argument could be made, but that argument needs to actually be made.
And the simple implication of pascal’s mugging is not uncontroversial, to put it mildly.
And Gödel’s incompleteness theorem is not accurately summarized as saying “There might be truths that are unknowable”, unless you’re very clear to indicate that “truth” and “unknowable” have technical meanings that don’t correspond very well to either the plain english meanings nor the typical philosophical definitions of those terms.
None of which means you’re actually wrong that alignment is impossible. A bad argument that the sun will rise tomorrow doesn’t mean the sun won’t rise tomorrow.
OK. But can you prove that “outcome with infinite utility” is nonsense? If not—probability is greater than 0 and less than 1.
Do I understand correctly that you do not agree with “all actions which lead to that outcome will have to dominate the agent’s behavior” from Pascal’s Mugging? Could you provide arguments for that?
I mean “uncontrollable” in a sense that alignment is impossible. Whatever goal you will provide, AGI will converge to Power Seeking, because of “an outcome with infinite utility may exist”.
I do not understand how this solves the problem.
Do you think you can prove that “an outcome with infinite utility does not exist”? Please elaborate
That’s not how any of this works, and I’ve spent all the time responding that I’m willing to waste today.
You’re literally making handwaving arguments, and replying to criticisms that the arguments don’t support the conclusions by saying “But maybe an argument could be made! You haven’t proven me wrong!” I’m not trying to prove you wrong, I’m saying there’s nothing here that can be proven wrong.
I’m not interested in wrestling with someone who will, when pinned to the mat, argue that because their pinky can still move, I haven’t really pinned them.
Please feel free to come back when you have stronger proof than this. Currently I feel that you are the one moving the pinky.
The problems that infinite utilities make for utility theory are well-known. They apply just as much to people trying to use utility theory (which is the original motivation for utility theory) as AIs. I could assemble a bibliography of papers discussing various aspects of the problem, but only if there’s significant interest.
I don’t think the implications are well-known (as the amount of downvotes indicates).
The premise that “infinite value” is possible, is an assumption.
This seems a bit like the presumption that “divide by zero” is possible. Assigning a probability to the possibility that divide by zero results in a value doesn’t make sense, I think, because the logical rules themselves rules this out.
However, if I look at this together with your earlier post (http://web.archive.org/web/20230317162246/https://www.lesswrong.com/posts/dPCpHZmGzc9abvAdi/orthogonality-thesis-is-wrong): I think I get where you’re coming from in that if the agent can conceptualise that (many) (extreme) high-value states are possible where those values are not yet known to it, yet still plans for those value possibilities in some kind of “RL discovery process”, then internal state-value optimisation converges on power-seeking behaviour — as optimal for reaching the expected value of such states in the future (this further assumes that the agent’s prior distribution lines up – eg. assumes unknown positive values are possible, does not have a prior distribution that is hugely negatively skewed over negative rewards).
I think initially specifying premises such as these more precisely initially ensures the reasoning from there is consistent/valid. The above would not apply to any agent, nor even to any “AGI” (a fuzzy term; I would define it more specifically as “fully-autonomous, cross-domain-optimising, artificial machinery”
Why do you think “infinite value” is logically impossible? Scientists do not dismiss possibility that the universe is infinite. https://bigthink.com/starts-with-a-bang/universe-infinite/
He didn’t say that “infinite value” is logically impossible. He desdribed it as an assumption.
When saying “is possible, I’m not sure if he meant “is possible (conceptually)” or “is possible (according to the ontology/optimization-criteria of any given agent)”. I think the latter would be most sensible.
He later said: “I think initially specifying premises such as these more precisely initially ensures the reasoning from there is consistent/valid.”. Not sure if I interpreted him correctly, but I saw it largely as an encouragment to think more explicitly about things like these (not be sloppy about it). Or if not an encouragement to do that, then at least pointing out that it’s something you’re currently not doing.
If we have a traditional/standard utility-function, and use traditional/standard math in regards to that utility function, then involving credences of infinitie-utility outcomes would typically make things “break down” (with most actions considered to have expected utilities that are either infinite or undefined).
Like, suppose action A has 0.001% chance of infinite negative utility and 99% chance of infinite positive utility. The utility of that action would, I think, be undefined (I haven’t looked into it). I can tell for sure that mathemathically it would not be regarded to have positive utility. Here is a video that explains why.
If that doesn’t make intuitive sense to you, then that’s fine. But mathemathically that’s how it is. And that’s something to have awareness of (account for in a non-handwavy way) if you’re trying to make a mathemathical argument with a basis in utility functions that deal with infinities.
Even if you did account for that it would be besides the point from my perspective, in more ways than one. So what we’re discussing now is not actually a crux for me.
For me personally, it would of course make a big difference whether there is a 0.00000001% chance of infinite positive utility or a 99.999999999% chance. But that is me going with my own intuitions. The standard math relating to EV-calculations doesn’t support this.
Do you think you can deny existence of an outcome with infinite utility? The fact that things “break down” is not a valid argument. If you cannot deny—it’s possible. And it it’s possible—alignment impossible.
To me, according to my preferences/goals/inclinations, there are conceivable outcomes with infinite utility/disutility.
But I think it is possible (and feasible) for a program/mind to be extremely capable, and affect the world, and not “care” about infinite outcomes.
I guess that depends on what’s being discussed. Like, it is something to take into account/consideration if you want to prove something while referencing utility-functions that reference infinities.
As I understand you do not agree with
from Pascal’s Mugging, not with me. Do you have any arguments for that?
I do have arguments for that, and I have already mentioned some of them earlier in our discussion (you may not share that assesment, despite us being relatively close in mind-space compared to most possible minds, but oh well).
Some of the more relevant comments from me are on one of the posts that you deleted.
As I mention here, I think I’ll try to round off this discussion. (Edit: I had a malformed/misleading sentence in that comment that should be fixed now.)