I see more interesting things going on in the comments, as far as what I was wondering, than what is in the posts themselves, as the posts all seem to assume we’ve sorted out some super basic stuff that I don’t know that humans have sorted out yet, such as if there is an objective “good”, etc., which seem rather necessary things to suss before trying to hew to— be it for us or AIs we create.
I get the premise, and I think Science Fiction has done an admirable job of laying it all out for us already, and I guess I’m just a bit confused as to if we’re writing fiction here or trying to be non-fictional?
as the posts all seem to assume we’ve sorted out some super basic stuff that I don’t know that humans have sorted out yet, such as if there is an objective “good”
One way to break down the alignment problem is between “how do we align the AI” and “what should we align it to”. It turns out we don’t have agreement on the second question and don’t know how to do the first. Even granted that we don’t have an answer to the second question it ses prudentto be able to answer the first?
By the time we answer the second it may be too late to answer the first.
i’m pretty sure solving either will solve both, and that understanding this is key to solving either. these all are the same thing afaict:
international relations (what are the steps towards building a “world’s EMT” force? how does one end war for good?)
complex systems alignment (what are the steps towards building a toolkit to move any complex system towards co-protective stability?)
inter-being alignment (how do you make practical conflict resolution easy to understand and implement?)
inter-neuron alignment (various forms of internal negotiation and ifs and blah blah etc)
biosecurity (how can cells protect and heal each other and repel invaders)
it’s all unavoidably the same stack of problems: how do you determine if a chunk of other-matter is in a shape which is safe and assistive for the self-matter’s shape, according to consensus of self-matter? how can two agentic chunks of matter establish mutual honesty without getting used against their own preference by the other? how do you ensure mutual honesty or interaction is not generated when it is not clear that there is safety to be honest or interact? how do you ensure it does happen when it is needed? this sounds like an economics problem to me. seems to me like we need multi-type multiscale economic feedback, to track damage vs fuel vs repair-aid.
it’s all unavoidably the same stack of problems: how do you determine if a chunk of other-matter is in a shape which is safe and assistive for the self-matter’s shape, according to consensus of self-matter?
Certainly there is a level of abstraction where it’s the same problem, but I see no reason why the solution will unavoidably be found at that level of abstraction.
It must depend on levels of intelligence and agency, right? I wonder if there is a threshold for both of those in machines and people that we’d need to reach for there to even be abstract solutions to these problems? For sure with machines we’re talking about far past what exists currently (they are not very intelligent, and do not have much agency), and it seems that while humans have been working on it for a while, we’re not exactly there yet either.
Seems like the alignment would have to be from micro to macro as well, with constant communication and reassessment, to prevent subversion.
Or, what was a fine self-chunk [arbitrary time ago], may not be now. Once you have stacks of “intelligent agents” (mesa or meta or otherwise) I’d think the predictability goes down, which is part of what worries folks. But if we don’t look at safety as something that is “tacked on after” for either humans or programs, but rather something innate to the very processes, perhaps there’s not so much to worry about.
Well, the same alignment issue happens with organizations, as well as within an individual with different goals and desires. It turns out that the existing “solutions” to these abstractly similar problems look quite different because the details matter a lot. And I think AGI is actually more dissimilar to any of these than they are to each other.
Do we all have the same definition of what AGI is? Do you mean being able to um, mimic the things a human can do, or are you talking full on Strong AI, sentient computers, etc.?
Like, if we’re talking The Singularity, we call it that because all bets are off past the event horizon.
Most the discussion here seems to sort of be talking about weak AI, or the road we’re on from what we have now (not even worthy of actually calling “AI”, IMHO— ML at least is a less overloaded term) to true AI, or the edge of that horizon line, as it were.
When you said “the same alignment issue happens with organizations, as well as within an individual with different goals and desires” I was like “yes!” but then you went on to say AGI is dissimilar, and I was like “no?”.
AGI as we’re talking about here is rather about abstractions, it seems, so if we come up with math that works for us, to prevent humans from doing Bad Stuff, it seems like those same checks and balances might work for our programs? At least we’d have an idea, right?
Or, maybe, we already have the idea, or at least the germination of one, as we somehow haven’t managed to destroy ourselves or the planet. Yet. 😝
Since we’re anthropomorphizing[1] so much— how to we align humans?
We’re worried about AI getting too powerful, but logically that means humans are getting too powerful, right? Thus what we have to do to cover question 1 (how), regardless of question 2 (what), is control human behavior, correct?
How do we ensure that we churn out “good” humans? Gods? Laws? Logic? Communication? Education? This is not a new question per se, and I guess the scary thing is that, perhaps, it is impossible to ensure that literally every human is Good™ (we’ll use a loose def of ‘you know what I mean— not evil!’).
This is only “scary” because humans are getting freakishly powerful. We no longer need an orchestra to play a symphony we’ve come up with, or multiple labs and decades to generate genetic treatments— and so on and so forth.
Frankly though, it seems kind of impossible to figure out a “how” if you don’t know the “what”, logically speaking.
I’m a fan of navel gazing, so it’s not like I’m saying this is a waste of time, but if people think they’re doing substantive work by rehashing/restating fictional stories which cover the same ideas in more digestible and entertaining formats…
Meh, I dunno, I guess I was just wondering if there was any meat to this stuff, and so far I haven’t found much. But I will keep looking.
I see a lot of people viewing AI from the “human” standpoint, and using terms like “reward” to mean a human version of the idea, versus how a program would see it (weights may be a better term? Often I see people thinking these “rewards” are like a dopamine hit for the AI or something, which is just not a good analogy IMHO), and I think that muddies the water, as by definition we’re talking non-human intelligence, theoretically… right? Or are we? Maybe the question is “what if the movie Lawnmower Man was real?” The human perspective seems to be the popular take (which makes sense as most of us are human).
How do we ensure that we churn out “good” humans? Gods? Laws? Logic? Communication? Education? This is not a new question per se, and I guess the scary thing is that, perhaps, it is impossible to ensure that literally every human is Good™ (we’ll use a loose def of ’you know what I mean— not evil
This is also a good question and one that’s quite important! If humans get powerful enough to destroy the world before AI does, even more important. One key difference of course is that we can design the AI in a way we can’t design ourselves.
I like that you have reservations about if we’re even powerful enough to destroy ourselves yet. Often I think “of course we are! Nukes, bioweapons, melting ice!”, but really, there’s no hard proof that we even can end ourselves.
It seems like the question of human regulation would be the first question, if we’re talking about AI safety, as the AI isn’t making itself (the egg comes first). Unless we’re talking about some type of fundamental rules that exist a priori. :)
This is what I’ve been asking and so far not finding any satisfactory answers for. Sci-Fi has forever warned us of the dangers of— well, pretty much any future-tech we can imagine— but especially thinking machines in the last century or so.
How do we ensure that humans design safe AI? And is it really a valid fear to think we’re not already building most the safety in, by the vary nature of “if the model doesn’t produce the results we want, we change it until it does”? Some of the debate seems to go back to a thing I said about selfishness. How much does the reasoning matter, if the outcome is the same? How much is semantics? If I use “selfish” to for all intents and purposes mean “unselfish” (the rising tide lifts all boats), how would searching my mental map for “selfish” or whatnot actually work? Ultimately it’s the actions, right?
I think this comes back to humans, and philosophy, and the stuff we haven’t quite sorted yet. Are thoughts actions? I mean, we have different words for them, so I guess not, but they can both be rendered as verbs, and are for sure linked. How useful would it actually be to be able to peer inside the mind of another? Does the timing matter? Depth? We know so little. Research is hard to reproduce. People seem to be both very individualistic, and groupable together like a survey.
FWIW it strikes me that there is a lot of anthropomorphic thinking going on, even for people who are on the lookout for it. Somewhere I mentioned how the word “reward” is probably not the best one to use, as it implies like a dopamine hit, which implies wireheading, and I’m not so sure that’s even possible for a computer— well as far as we know it’s impossible currently, and yet we’re using “reward systems” and other language which implies these models already have feelings.
I don’t know how we make it clear that “reward” is just for our thinking, to help visualize or whatever, and not literally what is happening. We are not training animals, we’re programming computers, and it’s mostly just math. Does math feel? Can an algorithm be rewarded? Maybe we should modify our language, be it literally by using different words, or meta by changing meaning (I prefer different words but to each their own).
I mean, I don’t really know if math has feelings. It might. What even are thoughts? Just some chemical reactions? Electricity and sugar or whatnot? Is the universe super-deterministic and did this thought, this sentence, basically exist from the first and will exist to the last? Wooeee! I love to think! Perhaps too much. Or not enough? Heh.
We’re worried about AI getting too powerful, but logically that means humans are getting too powerful, right?
One of the big fears with AI alignment is that the latter doesn’t logically proceed from the first. If you’re trying to create an AI that makes paperclips and then it kills all humans because it wasn’t aligned (with any human’s actual goals), it was powerful in a way that no human was. You do definitely need to worry about what goal the AI is aligned with, but even more important than that is ensuring that you can align an AI to any human’s preferences, or else the worry about which goal is pointless.
I think the human has to have the power first, logically, for the AI to have the power.
Like, if we put a computer model in charge of our nuclear arsenal, I could see the potential for Bad Stuff. Beyond all the movies we have of just humans being in charge of it (and the documented near catastrophic failures of said systems— which could have potentially made the Earth a Rough Place for Life for a while). I just don’t see us putting anything besides a human’s finger on the button, as it were.
By definition, if the model kills everyone instead of make paperclips, it’s a bad one, and why on Earth would we put a bad model in charge of something that can kill everyone? Because really, it was smart — not just smart, but sentient! — and it lied to us, so we thought it was good, and gave it more and more responsibilities until it showed its true colors and…
It seems as if the easy solution is: don’t put the paperclip making model in charge of a system that can wipe out humanity (again, the closest I can think of is nukes, tho the biological warfare is probably a more salient example/worry of late). But like, it wouldn’t be the “AI” unleashing a super-bio-weapon, right? It would be the human who thought the model they used to generate the germ had correctly generated the cure to the common cold, or whatever. Skipping straight to human trials because it made mice look and act a decade younger or whatnot.
I agree we need to be careful with our tech, and really I worry about how we do that— evil AI tho? not so much so
A paperclip manufacturing company puts an AI in charge of optimizing its paperclip production.
The AI optimizes the factory and then realizes that it could make more paperclips by turning more factories into paperclips. To do that, it has to be in charge of those factories, and humans won’t let it do that. So it needs to take control of those factories by force, without humans being able to stop it.
The AI develops a super virus that will be an epidemic to wipe out humanity.
The AI contacts a genetics lab and pays for the lab to manufacture the virus (or worse, it hacks into the system and manufactures the virus). This is a thing that already could be done.
The genetics lab ships the virus, not realizing what it is, to a random human’s house and the human opens it.
The human is infected, they spreads it, humanity dies.
The AI creates lots and lots of paperclips.
Obviously there’s a lot of missed steps there, but the key is that no one intentionally let the AI have control of anything important beyond connecting it to the internet. No human could or would have done all these steps, so it wasn’t seen as a risk, but the AI was able to and wanted to.
Other dangerous potential leverage points for it are things like nanotechnology (luckily this hasn’t been as developed as quickly as feared), the power grid (a real concern, even with human hackers), and nuclear weapons (luckily not connected to the internet).
Notably, these are all things that people on here are concerned about, so it’s not just concern about AI risk, but there are lots of ways that an AI could lever the internet into an existential threat to humanity and humans aren’t good at caring about security (partially because of the profit motive).
As you note, we don’t have nukes connected to the internet.
But we do use systems to determine when to launch nukes, and our senses/sensors are fallible, etc., which we’ve (barely— almost suspiciously “barely”, if you catch my drift[1]) managed to not interpret in a manner that caused us to change the season to “winter: nuclear style”.
Really I’m doing the same thing as the alignment debate is on about, but about the alignment debate itself.
Like, right now, it’s not too dangerous, because the voices calling for draconian solutions to the problem are not very loud. But this could change. And kind of is, at least in that they are getting louder. Or that you have artists wanting to harden IP law in a way that historically has only hurt artists (as opposed to corporations or Big Art, if you will) gaining a bit of steam.
These worrying signs seem to me to be more concrete than the, similar, but not as old, nor as concrete, worrisome signs of computer programs getting too much power and running amok[2].
If only because it hasn’t happened yet— no mentats or cylons or borg history — tho also arguably we don’t know if it’s possible… whereas authoritarian regimes certainly are possible and seem to be popular as of late[3].
Thanks for the links!
I see more interesting things going on in the comments, as far as what I was wondering, than what is in the posts themselves, as the posts all seem to assume we’ve sorted out some super basic stuff that I don’t know that humans have sorted out yet, such as if there is an objective “good”, etc., which seem rather necessary things to suss before trying to hew to— be it for us or AIs we create.
I get the premise, and I think Science Fiction has done an admirable job of laying it all out for us already, and I guess I’m just a bit confused as to if we’re writing fiction here or trying to be non-fictional?
One way to break down the alignment problem is between “how do we align the AI” and “what should we align it to”. It turns out we don’t have agreement on the second question and don’t know how to do the first. Even granted that we don’t have an answer to the second question it ses prudentto be able to answer the first?
By the time we answer the second it may be too late to answer the first.
i’m pretty sure solving either will solve both, and that understanding this is key to solving either. these all are the same thing afaict:
international relations (what are the steps towards building a “world’s EMT” force? how does one end war for good?)
complex systems alignment (what are the steps towards building a toolkit to move any complex system towards co-protective stability?)
inter-being alignment (how do you make practical conflict resolution easy to understand and implement?)
inter-neuron alignment (various forms of internal negotiation and ifs and blah blah etc)
biosecurity (how can cells protect and heal each other and repel invaders)
it’s all unavoidably the same stack of problems: how do you determine if a chunk of other-matter is in a shape which is safe and assistive for the self-matter’s shape, according to consensus of self-matter? how can two agentic chunks of matter establish mutual honesty without getting used against their own preference by the other? how do you ensure mutual honesty or interaction is not generated when it is not clear that there is safety to be honest or interact? how do you ensure it does happen when it is needed? this sounds like an economics problem to me. seems to me like we need multi-type multiscale economic feedback, to track damage vs fuel vs repair-aid.
eg, on the individual/small group scale: https://www.microsolidarity.cc/
Certainly there is a level of abstraction where it’s the same problem, but I see no reason why the solution will unavoidably be found at that level of abstraction.
It must depend on levels of intelligence and agency, right? I wonder if there is a threshold for both of those in machines and people that we’d need to reach for there to even be abstract solutions to these problems? For sure with machines we’re talking about far past what exists currently (they are not very intelligent, and do not have much agency), and it seems that while humans have been working on it for a while, we’re not exactly there yet either.
Seems like the alignment would have to be from micro to macro as well, with constant communication and reassessment, to prevent subversion.
Or, what was a fine self-chunk [arbitrary time ago], may not be now. Once you have stacks of “intelligent agents” (mesa or meta or otherwise) I’d think the predictability goes down, which is part of what worries folks. But if we don’t look at safety as something that is “tacked on after” for either humans or programs, but rather something innate to the very processes, perhaps there’s not so much to worry about.
Well, the same alignment issue happens with organizations, as well as within an individual with different goals and desires. It turns out that the existing “solutions” to these abstractly similar problems look quite different because the details matter a lot. And I think AGI is actually more dissimilar to any of these than they are to each other.
Do we all have the same definition of what AGI is? Do you mean being able to um, mimic the things a human can do, or are you talking full on Strong AI, sentient computers, etc.?
Like, if we’re talking The Singularity, we call it that because all bets are off past the event horizon.
Most the discussion here seems to sort of be talking about weak AI, or the road we’re on from what we have now (not even worthy of actually calling “AI”, IMHO— ML at least is a less overloaded term) to true AI, or the edge of that horizon line, as it were.
When you said “the same alignment issue happens with organizations, as well as within an individual with different goals and desires” I was like “yes!” but then you went on to say AGI is dissimilar, and I was like “no?”.
AGI as we’re talking about here is rather about abstractions, it seems, so if we come up with math that works for us, to prevent humans from doing Bad Stuff, it seems like those same checks and balances might work for our programs? At least we’d have an idea, right?
Or, maybe, we already have the idea, or at least the germination of one, as we somehow haven’t managed to destroy ourselves or the planet. Yet. 😝
Since we’re anthropomorphizing[1] so much— how to we align humans?
We’re worried about AI getting too powerful, but logically that means humans are getting too powerful, right? Thus what we have to do to cover question 1 (how), regardless of question 2 (what), is control human behavior, correct?
How do we ensure that we churn out “good” humans? Gods? Laws? Logic? Communication? Education? This is not a new question per se, and I guess the scary thing is that, perhaps, it is impossible to ensure that literally every human is Good™ (we’ll use a loose def of ‘you know what I mean— not evil!’).
This is only “scary” because humans are getting freakishly powerful. We no longer need an orchestra to play a symphony we’ve come up with, or multiple labs and decades to generate genetic treatments— and so on and so forth.
Frankly though, it seems kind of impossible to figure out a “how” if you don’t know the “what”, logically speaking.
I’m a fan of navel gazing, so it’s not like I’m saying this is a waste of time, but if people think they’re doing substantive work by rehashing/restating fictional stories which cover the same ideas in more digestible and entertaining formats…
Meh, I dunno, I guess I was just wondering if there was any meat to this stuff, and so far I haven’t found much. But I will keep looking.
I see a lot of people viewing AI from the “human” standpoint, and using terms like “reward” to mean a human version of the idea, versus how a program would see it (weights may be a better term? Often I see people thinking these “rewards” are like a dopamine hit for the AI or something, which is just not a good analogy IMHO), and I think that muddies the water, as by definition we’re talking non-human intelligence, theoretically… right? Or are we? Maybe the question is “what if the movie Lawnmower Man was real?” The human perspective seems to be the popular take (which makes sense as most of us are human).
This is also a good question and one that’s quite important! If humans get powerful enough to destroy the world before AI does, even more important. One key difference of course is that we can design the AI in a way we can’t design ourselves.
I like that you have reservations about if we’re even powerful enough to destroy ourselves yet. Often I think “of course we are! Nukes, bioweapons, melting ice!”, but really, there’s no hard proof that we even can end ourselves.
It seems like the question of human regulation would be the first question, if we’re talking about AI safety, as the AI isn’t making itself (the egg comes first). Unless we’re talking about some type of fundamental rules that exist a priori. :)
This is what I’ve been asking and so far not finding any satisfactory answers for. Sci-Fi has forever warned us of the dangers of— well, pretty much any future-tech we can imagine— but especially thinking machines in the last century or so.
How do we ensure that humans design safe AI? And is it really a valid fear to think we’re not already building most the safety in, by the vary nature of “if the model doesn’t produce the results we want, we change it until it does”? Some of the debate seems to go back to a thing I said about selfishness. How much does the reasoning matter, if the outcome is the same? How much is semantics? If I use “selfish” to for all intents and purposes mean “unselfish” (the rising tide lifts all boats), how would searching my mental map for “selfish” or whatnot actually work? Ultimately it’s the actions, right?
I think this comes back to humans, and philosophy, and the stuff we haven’t quite sorted yet. Are thoughts actions? I mean, we have different words for them, so I guess not, but they can both be rendered as verbs, and are for sure linked. How useful would it actually be to be able to peer inside the mind of another? Does the timing matter? Depth? We know so little. Research is hard to reproduce. People seem to be both very individualistic, and groupable together like a survey.
FWIW it strikes me that there is a lot of anthropomorphic thinking going on, even for people who are on the lookout for it. Somewhere I mentioned how the word “reward” is probably not the best one to use, as it implies like a dopamine hit, which implies wireheading, and I’m not so sure that’s even possible for a computer— well as far as we know it’s impossible currently, and yet we’re using “reward systems” and other language which implies these models already have feelings.
I don’t know how we make it clear that “reward” is just for our thinking, to help visualize or whatever, and not literally what is happening. We are not training animals, we’re programming computers, and it’s mostly just math. Does math feel? Can an algorithm be rewarded? Maybe we should modify our language, be it literally by using different words, or meta by changing meaning (I prefer different words but to each their own).
I mean, I don’t really know if math has feelings. It might. What even are thoughts? Just some chemical reactions? Electricity and sugar or whatnot? Is the universe super-deterministic and did this thought, this sentence, basically exist from the first and will exist to the last? Wooeee! I love to think! Perhaps too much. Or not enough? Heh.
One of the big fears with AI alignment is that the latter doesn’t logically proceed from the first. If you’re trying to create an AI that makes paperclips and then it kills all humans because it wasn’t aligned (with any human’s actual goals), it was powerful in a way that no human was. You do definitely need to worry about what goal the AI is aligned with, but even more important than that is ensuring that you can align an AI to any human’s preferences, or else the worry about which goal is pointless.
I think the human has to have the power first, logically, for the AI to have the power.
Like, if we put a computer model in charge of our nuclear arsenal, I could see the potential for Bad Stuff. Beyond all the movies we have of just humans being in charge of it (and the documented near catastrophic failures of said systems— which could have potentially made the Earth a Rough Place for Life for a while). I just don’t see us putting anything besides a human’s finger on the button, as it were.
By definition, if the model kills everyone instead of make paperclips, it’s a bad one, and why on Earth would we put a bad model in charge of something that can kill everyone? Because really, it was smart — not just smart, but sentient! — and it lied to us, so we thought it was good, and gave it more and more responsibilities until it showed its true colors and…
It seems as if the easy solution is: don’t put the paperclip making model in charge of a system that can wipe out humanity (again, the closest I can think of is nukes, tho the biological warfare is probably a more salient example/worry of late). But like, it wouldn’t be the “AI” unleashing a super-bio-weapon, right? It would be the human who thought the model they used to generate the germ had correctly generated the cure to the common cold, or whatever. Skipping straight to human trials because it made mice look and act a decade younger or whatnot.
I agree we need to be careful with our tech, and really I worry about how we do that— evil AI tho? not so much so
The feared outcome looks something like this:
A paperclip manufacturing company puts an AI in charge of optimizing its paperclip production.
The AI optimizes the factory and then realizes that it could make more paperclips by turning more factories into paperclips. To do that, it has to be in charge of those factories, and humans won’t let it do that. So it needs to take control of those factories by force, without humans being able to stop it.
The AI develops a super virus that will be an epidemic to wipe out humanity.
The AI contacts a genetics lab and pays for the lab to manufacture the virus (or worse, it hacks into the system and manufactures the virus). This is a thing that already could be done.
The genetics lab ships the virus, not realizing what it is, to a random human’s house and the human opens it.
The human is infected, they spreads it, humanity dies.
The AI creates lots and lots of paperclips.
Obviously there’s a lot of missed steps there, but the key is that no one intentionally let the AI have control of anything important beyond connecting it to the internet. No human could or would have done all these steps, so it wasn’t seen as a risk, but the AI was able to and wanted to.
Other dangerous potential leverage points for it are things like nanotechnology (luckily this hasn’t been as developed as quickly as feared), the power grid (a real concern, even with human hackers), and nuclear weapons (luckily not connected to the internet).
Notably, these are all things that people on here are concerned about, so it’s not just concern about AI risk, but there are lots of ways that an AI could lever the internet into an existential threat to humanity and humans aren’t good at caring about security (partially because of the profit motive).
I get the premise, and it’s a fun one to think about, but what springs to mind is
Phase 1: collect underpants
Phase 2: ???
Phase 3: kill all humans
As you note, we don’t have nukes connected to the internet.
But we do use systems to determine when to launch nukes, and our senses/sensors are fallible, etc., which we’ve (barely— almost suspiciously “barely”, if you catch my drift[1]) managed to not interpret in a manner that caused us to change the season to “winter: nuclear style”.
Really I’m doing the same thing as the alignment debate is on about, but about the alignment debate itself.
Like, right now, it’s not too dangerous, because the voices calling for draconian solutions to the problem are not very loud. But this could change. And kind of is, at least in that they are getting louder. Or that you have artists wanting to harden IP law in a way that historically has only hurt artists (as opposed to corporations or Big Art, if you will) gaining a bit of steam.
These worrying signs seem to me to be more concrete than the, similar, but not as old, nor as concrete, worrisome signs of computer programs getting too much power and running amok[2].
we are living in a simulation with some interesting rules we are designed not to notice
If only because it hasn’t happened yet— no mentats or cylons or borg history — tho also arguably we don’t know if it’s possible… whereas authoritarian regimes certainly are possible and seem to be popular as of late[3].
hoping this observation is just confirmation bias and not a “real” trend. #fingerscrossed