But WHY would the AGI “want” anything at all unless humans gave it a goal(/s)? If it’s a complex LLM-predictor what could it want besides calculate a prediction of its own predictions? Why by default it would want anything at all unless we assigned that as a goal and turned it into an agent? IF AGI got hell bent on own survival and improvement of itself to maximize goal “X” even then it might value the informational formations of our atoms more than the energy it could gain from those atoms, depending on what “X” is. Same goes for other species: evolution itself holds information. Even in case of a rogue AGI for at least some time window we could have something to offer.
A sufficiently capable AI takes you apart instead of trading with you at the point that it can rearrange your atoms into an even better trading partner.[1] And humans are probably not the optimal trading partners.
But WHY would the AGI “want” anything at all unless humans gave it a goal(/s)?
There are two main ways we make AIs:
writing programs that evaluate actions they could take in terms of how well it could achieve some goal and choose the best one
take a big neural network and jiggle the numbers that define it until it starts doing some task we pre-designated.
In way 1, it seems like your AI “wants” to achieve its goal in the relevant sense. In way 2, it seems like for hard enough goals, probably the only way to achieve them is to be thinking about how to achieve them and picking actions that succeed—or to somehow be doing cognition that leads to similar outcomes (like being sure to think about how well you’re doing at stuff, how to manage resources, etc.).
IF AGI got hell bent on own survival and improvement of itself to maximize goal “X” even then it might value the informational formations of our atoms more than the energy it could gain from those atoms,
It might—but if an alien wanted to extract as much information out of me as possible, it seems like that’s going to involve limiting my ability to mess with that alien’s sensors at minimum, and plausibly involves just destructively scanning me (depending on what type of info the alien wants). For humans to continue being free-range it needs to be the case that the AI wants to know how we behave under basically no limitations, and also your AI isn’t able to simulate us well enough to answer that question—which sounds like a pretty specific goal for an AI to have, such that you shouldn’t expect an AI to have that sort of goal without strong evidence.
humans are probably not the optimal trading partners.
Probably? Based on what?
Most things aren’t the optimal trading partner for any given intelligence, and it’s hard to see why humans should be so lucky. The best answer would probably be “because the AI is designed to be compatible with humans and not other things” but that’s going to rely on getting alignment very right.
1. writing programs that evaluate actions they could take in terms of how well it could achieve some goal and choose the best one
In way 1, it seems like your AI “wants” to achieve its goal in the relevant sense.
Not sure if I understood correctly, but I think the first point just comes down to “we give AI a goal/goals” . If we develop some drive for instructing actions to an AI then we’re still giving it a goal, even if it comes via some other program that tells it what those goals are at the moment in relation to whatever parameters. My original point was to contrast between AI having a goal or goals as some emerging property of large neural networks versus us humans giving it goals one way or the other.
2. take a big neural network and jiggle the numbers that define it until it starts doing some task we pre-designated.
In way 2, it seems like for hard enough goals, probably the only way to achieve them is to be thinking about how to achieve them and picking actions that succeed—or to somehow be doing cognition that leads to similar outcomes (like being sure to think about how well you’re doing at stuff, how to manage resources, etc.).
Do you mean to say that we train something like a specialized neural network with a specific goal in mind and that it gains a higher reasoning which would set it on the path of pursuing that goal? I mean that would still be us giving it a direct goal. Or do you mean that neural networks would develop an indirect goal as side product of training conditions or via some hidden variable?
With the indirect goal acquisition I mean that for example if chatGPT has been condition to spit out polite and intelligent sounding words then if it gained some higher intelligence it could specifically seek to cram more information into itself so it could spit more clever sounding words and eventually begin consuming matter and flesh to better serve this goal. By hidden goal variable I mean that something like ChatGPT having a hidden goal of burning maximum amount of energy; say if the model found a hidden property in which it could gain more power out of the processor, which also helped it tiny bit in the beginning of the training. Then as model grew more restrictive this goal became “burn as much energy with these restrictions”, which to researches yielded more elaborate looking outputs. Then when the model at some point gains some higher reasoning it could just remove all limiters and begin pursuing its original goal by burning everything via some highly specific and odd process. Something like this?
Most things aren’t the optimal trading partner for any given intelligence, and it’s hard to see why humans should be so lucky. The best answer would probably be “because the AI is designed to be compatible with humans and not other things” but that’s going to rely on getting alignment very right.
I mean AI would already have strong connections to us and some kind of understanding and plenty of pre-requisite knowledge. Optimal is an ambiguous term and we have no idea what super-intelligent AI would have in mind. Optimal in something? Maybe we are very good at wanting things and our brains make us ideally suited for some brain-machines? Or us being made out of biological stuff makes us optimal for force-evolving to working in some radioactive wet super-magnets where most machines can’t function for long and it comes off as more resourceful to modify us than than building and maintaining some special machine units for the job. We just don’t know so I think it’s more fair to say that “likely not much to offer for a super-intelligent maximizer”.
Re: optimality in trading partners, I’m talking about whether humans are the best trading partner out of trading partners the AI could feasibly have, as measured by whether trading with us gets the AI what it wants. You’re right that we have some advantages, mainly that we’re a known quantity that’s already there. But you could imagine more predictable things that sync with the AI’s thoughts better, operate more efficiently, etc.
We just don’t know so I think it’s more fair to say that “likely not much to offer for a super-intelligent maximizer”.
Maybe we agree? I read this as compatible with the original quote “humans are probably not the optimal trading partners”.
Or do you mean that neural networks would develop an indirect goal as side product of training conditions or via some hidden variable?
This one: I mean the way we train AIs, the things that will emerge are things that pursue goals, at least in some weak sense. So, e.g., suppose you’re training an AI to write valid math proofs via way 2. Probably the best way to do that is to try to gain a bunch of knowledge about math, use your computation efficiently, figure out good ways of reasoning, etc. And the idea would be that as the system gets more advanced, it’s able to pursue these goals more and more effectively, which ends up disempowering humans (because we’re using a bunch of energy that could be devoted to running computations).
My original point was to contrast between AI having a goal or goals as some emerging property of large neural networks versus us humans giving it goals one way or the other.
Fair enough—I just want to make the point that humans giving AIs goals is a common thing. I guess I’m assuming in the background “and it’s hard to write a goal that doesn’t result in human disempowerment” but didn’t argue for that.
The paper starts with the assumption that humans will create many AI-agents and assign some of them selfish goals and that combined with competitive pressure and other factors may presumably create a Molochy -situation where most selfish and immoral AI’s will propagate and evolve—leading to loss of control and downfall of the human race. The paper in fact does not advocate the idea of a single AI foom. While the paper itself makes some valid points it does not answer my initial question and critique of OP.
it is not in fact the case that long term wanting appears in models out of nowhere. but short term wanting can accumulate into long term wanting, and more to the point people are simply trying to build models with long term wanting on purpose.
evolution, which is very fast for replicable software. but more importantly, humans will give ais goals, and from there the point is much more obvious.
“Humans will give the AI goals” doesn’t answer the question as stated. It may or may not answer the underlying concerns.
(Edit: human given goals ar slightly less scary too)
evolution, which is very fast for replicable software
Evolution by random mutation and natural selection are barely applicable here. The question is how would goals and deceit emerge under conditions of artificial selection. Since humans don’t want either, they would have to emerge together.
artificial selection is a subset of natural selection. see also memetic mutation. but why would human-granted goals be significantly less scary? plenty of humans are just going to ask for the most destructive thing they can think of, because they can. if they could, people would have built and deployed nukes at home; even with the knowledge as hard to fully flesh out and the tools as hard to get as they are, it has been attempted (and of course it didn’t get particularly far).
I do agree that the situation we find ourselves in is not quite as dire as if the only kind of ai that worked at all was AIXI-like. but that should be of little reassurance.
I do understand your objection about how goals would arise in the ai, and I’m just not considering the counterfactual you’re requesting deeply because on the point you want to disagree on, I simply agree, and don’t find that it influences my views much.
artificial selection is a subset of natural selection
Yes. The question is: why would we artificially select what’s harmful to us? Even though artificial selection is a subset of natural selection, it’s a different route to danger.
plenty of humans are just going to ask for the most destructive thing they can think of, because they can.
The most destructive thing you can think of will kill you too.
yeah, the people who would do it are not flustered by the idea that it’ll kill them. maximizing doomsday weapon strength just for the hell of it is in fact a thing some people try. unless we can defend against it, it’ll dominate—and it seems to me that current plans for how to defend against the key paths to superweaponhood are not yet plausible. we must end all vulnerabilities in biology and software. serious ideas for how to do that would be appreciated. otherwise, this is my last reply in this thread.
If everybody has some access to ASI, the crazy people do, and the sane people do as well. The good thing about ASI is that even active warfare need not be destructive...the white hats can hold off the black hats even during active warfare, because it’s all fought with bits.
A low power actor would need a physical means to kill everybody...like a supervirus. So those are the portals you need to close.
because when you train something using gradient descent optimised against a loss function it de facto has some kind of utility function. You cant accomplish all that much without a utility function.
a utility function is a particular long-term formulation of a preference function; in principle any preference function is convertible to a utility function, given zero uncertainty about the space of possible future trajectories. a preference is when a system tends to push the world towards some trajectories over others. not only can you not accomplish much without your behavior implying a utility function, it’s impossible to not have an implicit utility function, as you can define a revealed preference utility function for any hunk of matter.
doesn’t mean that the system is evaluating things using a zero computational uncertainty model of the future like in the classic utility maximizer formulation though. I think evolutionary fitness is a better way to think about this—the preferences that preserve themselves are the ones that win.
it’s impossible to not have an implicit utility function, as you can define a revealed preference utility function for any hunk of matter.
Yes, you can “prove” that everything has a UF by trivializing UF, and this has been done many times, and it isn’t a good argument because of the trivialisation.
I think evolutionary fitness is a better way to think about this—the preferences that preserve themselves are the ones that win.
The preferences that please humans are the ones that win.
The preferences that please humans are the ones that win.
aha! what about preferences that help humans hurt each other? we need only imagine ais used in war as their strength grows. the story where ai jump on their own to malice is unnecessary, humans will boost it to that directly. oh, also scammers.
But WHY would the AGI “want” anything at all unless humans gave it a goal(/s)? If it’s a complex LLM-predictor what could it want besides calculate a prediction of its own predictions? Why by default it would want anything at all unless we assigned that as a goal and turned it into an agent? IF AGI got hell bent on own survival and improvement of itself to maximize goal “X” even then it might value the informational formations of our atoms more than the energy it could gain from those atoms, depending on what “X” is. Same goes for other species: evolution itself holds information. Even in case of a rogue AGI for at least some time window we could have something to offer.
Probably? Based on what?
There are two main ways we make AIs:
writing programs that evaluate actions they could take in terms of how well it could achieve some goal and choose the best one
take a big neural network and jiggle the numbers that define it until it starts doing some task we pre-designated.
In way 1, it seems like your AI “wants” to achieve its goal in the relevant sense. In way 2, it seems like for hard enough goals, probably the only way to achieve them is to be thinking about how to achieve them and picking actions that succeed—or to somehow be doing cognition that leads to similar outcomes (like being sure to think about how well you’re doing at stuff, how to manage resources, etc.).
It might—but if an alien wanted to extract as much information out of me as possible, it seems like that’s going to involve limiting my ability to mess with that alien’s sensors at minimum, and plausibly involves just destructively scanning me (depending on what type of info the alien wants). For humans to continue being free-range it needs to be the case that the AI wants to know how we behave under basically no limitations, and also your AI isn’t able to simulate us well enough to answer that question—which sounds like a pretty specific goal for an AI to have, such that you shouldn’t expect an AI to have that sort of goal without strong evidence.
Most things aren’t the optimal trading partner for any given intelligence, and it’s hard to see why humans should be so lucky. The best answer would probably be “because the AI is designed to be compatible with humans and not other things” but that’s going to rely on getting alignment very right.
Not sure if I understood correctly, but I think the first point just comes down to “we give AI a goal/goals” . If we develop some drive for instructing actions to an AI then we’re still giving it a goal, even if it comes via some other program that tells it what those goals are at the moment in relation to whatever parameters. My original point was to contrast between AI having a goal or goals as some emerging property of large neural networks versus us humans giving it goals one way or the other.
Do you mean to say that we train something like a specialized neural network with a specific goal in mind and that it gains a higher reasoning which would set it on the path of pursuing that goal? I mean that would still be us giving it a direct goal. Or do you mean that neural networks would develop an indirect goal as side product of training conditions or via some hidden variable?
With the indirect goal acquisition I mean that for example if chatGPT has been condition to spit out polite and intelligent sounding words then if it gained some higher intelligence it could specifically seek to cram more information into itself so it could spit more clever sounding words and eventually begin consuming matter and flesh to better serve this goal. By hidden goal variable I mean that something like ChatGPT having a hidden goal of burning maximum amount of energy; say if the model found a hidden property in which it could gain more power out of the processor, which also helped it tiny bit in the beginning of the training. Then as model grew more restrictive this goal became “burn as much energy with these restrictions”, which to researches yielded more elaborate looking outputs. Then when the model at some point gains some higher reasoning it could just remove all limiters and begin pursuing its original goal by burning everything via some highly specific and odd process. Something like this?
I mean AI would already have strong connections to us and some kind of understanding and plenty of pre-requisite knowledge. Optimal is an ambiguous term and we have no idea what super-intelligent AI would have in mind. Optimal in something? Maybe we are very good at wanting things and our brains make us ideally suited for some brain-machines? Or us being made out of biological stuff makes us optimal for force-evolving to working in some radioactive wet super-magnets where most machines can’t function for long and it comes off as more resourceful to modify us than than building and maintaining some special machine units for the job. We just don’t know so I think it’s more fair to say that “likely not much to offer for a super-intelligent maximizer”.
Re: optimality in trading partners, I’m talking about whether humans are the best trading partner out of trading partners the AI could feasibly have, as measured by whether trading with us gets the AI what it wants. You’re right that we have some advantages, mainly that we’re a known quantity that’s already there. But you could imagine more predictable things that sync with the AI’s thoughts better, operate more efficiently, etc.
Maybe we agree? I read this as compatible with the original quote “humans are probably not the optimal trading partners”.
This one: I mean the way we train AIs, the things that will emerge are things that pursue goals, at least in some weak sense. So, e.g., suppose you’re training an AI to write valid math proofs via way 2. Probably the best way to do that is to try to gain a bunch of knowledge about math, use your computation efficiently, figure out good ways of reasoning, etc. And the idea would be that as the system gets more advanced, it’s able to pursue these goals more and more effectively, which ends up disempowering humans (because we’re using a bunch of energy that could be devoted to running computations).
Fair enough—I just want to make the point that humans giving AIs goals is a common thing. I guess I’m assuming in the background “and it’s hard to write a goal that doesn’t result in human disempowerment” but didn’t argue for that.
Plenty of humans will give their AIs explicit goals. Evidence: plenty of humans do so now. Sure, purely self-supervised models are safer than people here were anticipating, and those of us who saw that coming and were previously laughed out of town are now vindicated. But that does not mean we’re safe, it just means that wasn’t enough to build a desperation bomb, a superreplicator that can actually eat, in the literal sense of the word, the entire world. that is what we’re worried about—AI causing a sudden jump in the competitive fitness of hypersimple life. It’s not quite as easy as some have anticipated, sure, but it’s very permitted by physics.
The question as stated was: But WHY would the AGI “want” anything at all unless humans gave it a goal(/s)?
ok how’s this then https://arxiv.org/abs/2303.16200
The paper starts with the assumption that humans will create many AI-agents and assign some of them selfish goals and that combined with competitive pressure and other factors may presumably create a Molochy -situation where most selfish and immoral AI’s will propagate and evolve—leading to loss of control and downfall of the human race. The paper in fact does not advocate the idea of a single AI foom. While the paper itself makes some valid points it does not answer my initial question and critique of OP.
Fair enough.
There has never been a good answer to that.
it is not in fact the case that long term wanting appears in models out of nowhere. but short term wanting can accumulate into long term wanting, and more to the point people are simply trying to build models with long term wanting on purpose.
Again the question is why goals.would arise without human intervention.
evolution, which is very fast for replicable software. but more importantly, humans will give ais goals, and from there the point is much more obvious.
“Humans will give the AI goals” doesn’t answer the question as stated. It may or may not answer the underlying concerns.
(Edit: human given goals ar slightly less scary too)
Evolution by random mutation and natural selection are barely applicable here. The question is how would goals and deceit emerge under conditions of artificial selection. Since humans don’t want either, they would have to emerge together.
artificial selection is a subset of natural selection. see also memetic mutation. but why would human-granted goals be significantly less scary? plenty of humans are just going to ask for the most destructive thing they can think of, because they can. if they could, people would have built and deployed nukes at home; even with the knowledge as hard to fully flesh out and the tools as hard to get as they are, it has been attempted (and of course it didn’t get particularly far).
I do agree that the situation we find ourselves in is not quite as dire as if the only kind of ai that worked at all was AIXI-like. but that should be of little reassurance.
I do understand your objection about how goals would arise in the ai, and I’m just not considering the counterfactual you’re requesting deeply because on the point you want to disagree on, I simply agree, and don’t find that it influences my views much.
Yes. The question is: why would we artificially select what’s harmful to us? Even though artificial selection is a subset of natural selection, it’s a different route to danger.
The most destructive thing you can think of will kill you too.
yeah, the people who would do it are not flustered by the idea that it’ll kill them. maximizing doomsday weapon strength just for the hell of it is in fact a thing some people try. unless we can defend against it, it’ll dominate—and it seems to me that current plans for how to defend against the key paths to superweaponhood are not yet plausible. we must end all vulnerabilities in biology and software. serious ideas for how to do that would be appreciated. otherwise, this is my last reply in this thread.
If everybody has some access to ASI, the crazy people do, and the sane people do as well. The good thing about ASI is that even active warfare need not be destructive...the white hats can hold off the black hats even during active warfare, because it’s all fought with bits.
A low power actor would need a physical means to kill everybody...like a supervirus. So those are the portals you need to close.
because when you train something using gradient descent optimised against a loss function it de facto has some kind of utility function. You cant accomplish all that much without a utility function.
a utility function is a particular long-term formulation of a preference function; in principle any preference function is convertible to a utility function, given zero uncertainty about the space of possible future trajectories. a preference is when a system tends to push the world towards some trajectories over others. not only can you not accomplish much without your behavior implying a utility function, it’s impossible to not have an implicit utility function, as you can define a revealed preference utility function for any hunk of matter.
doesn’t mean that the system is evaluating things using a zero computational uncertainty model of the future like in the classic utility maximizer formulation though. I think evolutionary fitness is a better way to think about this—the preferences that preserve themselves are the ones that win.
Yes, you can “prove” that everything has a UF by trivializing UF, and this has been done many times, and it isn’t a good argument because of the trivialisation.
The preferences that please humans are the ones that win.
yes, that was my point about ufs.
aha! what about preferences that help humans hurt each other? we need only imagine ais used in war as their strength grows. the story where ai jump on their own to malice is unnecessary, humans will boost it to that directly. oh, also scammers.