I notice that I am confused by not seeing discourse about using AI alignment solutions for human alignment. It seems like the world as we know it is badly threatened by humans behaving in ways I’d describe as poorly aligned, for an understanding of “alignment” formed mostly from context in AI discussions in this community.
I get that AI is different from people—we assume it’s much “smarter”, for one thing. Yet every “AI” we’ve built so far has amplified traits of humanity that we consider flaws, as well as those we consider virtues. Do we expect that this would magically stop being the case if it passed a certain threshhold?
And doesn’t alignment, in the most general terms, get harder when it’s applied to “smarter” entities? If that’s the case, then it seems like the “less smart” entities of human leaders would be a perfect place to test strategies we think will generalize to “smarter” entities. Conversely, if we can’t apply alignment findings to humans because alignment gets “easier” / more tractable when applied to “smarter” entities, doesn’t that suggest a degenerate case of minimum alignment difficulty for a maximally “smart” AI?
The AI problem is easier in some ways (and significantly harder in others) because we’re not taking an existing system and trying to align it. We want to design the system (and/or systems that produce that system, aka optimization) to be aligned in the first place. This can be done through formal work to provide guarantees, lots of code, and lots of testing.
However, doing that for some arbitrary agent or even just a human isn’t really a focus of most alignment research. A human has the issue that they’re already misaligned (in a sense), and there are many various technological/ethical/social issues with either retraining them or performing the modifications to get them aligned. If the ideas that people had for alignment were about ‘converting’ a misaligned intelligence to an aligned one, then humans could maybe be a test-case, but that isn’t really the focus. We also are only ‘slowly’ advancing our ability to understand the body and how the brain works. While we have some of the same issues with neural networks, it is a lot cheaper, less unethical, we can rerun it (for non-dangerous networks), etcetera.
Though, there has been talk of things like incentives, moral mazes, inadequate equilibria and more which are somewhat related to the alignment/misalignment of humans and where they can do better.
Thank you for clarifying! This highlights an assumption about AI so fundamental that I wasn’t previously fully aware that I had it. As you say, there’s a big difference between what to do if we discover AI, vs if we create it. While I think that we as a species are likely to create something that meets our definition of strong AI sooner or later, I consider it vanishingly unlikely that any specific individual or group who goes out trying to create it will actually succeed. So for most of us, especially myself, I figure that on an individual level it’ll be much more like discovering an AI that somebody else created (possibly by accident) than actually creating the thing.
It’s intuitively obvious why alignment work on creating AI doesn’t apply to extant systems. But if the best that the people who care most about it can do is work on created AI without yet applying any breakthroughs to the prospect of a discovered AI (where we can’t count on knowing how it works, ethically create and then destroy a bunch of instances of it, etc)… I think I am beginning to see where we get the meme of how one begins to think hard about these topics and shortly afterward spends a while being extremely frightened.
Yet every “AI” we’ve built so far has amplified traits of humanity that we consider flaws, as well as those we consider virtues. Do we expect that this would magically stop being the case if it passed a certain threshhold?
Ah, what? (I’m reacting to the “every” qualifier here.)
I’d say it comes down to founder effects.
I wouldn’t necessarily call it ‘using AI alignment solutions for human alignment’ though.
Perhaps a better starting point would be: how to discern alignment. And, are there predictable betrayals? Can that situation be improved?
human leaders
That wasn’t the first place I thought of.
How do you tell if a source is trustworthy? (Of information, or a physical good.)
How do you tell if it’s a good idea for someone to join your team?
Overall, human alignment sounds broad, and interesting.
There’s also some stuff about open source, questions that seem relevant. Less specifically, I read on twitter that:
Elon Musk wants to release the twitter algorithm, and for it develop encrypted chat or stomething. (I read the tweet.)
I think the person in charge of Mastodon (which is already open source) said something about working on encrypted chat as well. (I read the blog post.)
Somehow I feel it’s more likely that Mastodon will end up achieving both conditions than Twitter will.
Two announcements, but one doesn’t inspire much confidence. (How often does a not open source project go open source? Not partially, but fully. I see this as a somewhat general issue (open sourcing probability), not just one of specific context, or ‘the laws of probability say p(a and b) < p(a) or p(b) independently (if a and b are different), and here p(a) and p(a’) are reasonably similar’.)
I notice that I am confused by not seeing discourse about using AI alignment solutions for human alignment. It seems like the world as we know it is badly threatened by humans behaving in ways I’d describe as poorly aligned, for an understanding of “alignment” formed mostly from context in AI discussions in this community.
I get that AI is different from people—we assume it’s much “smarter”, for one thing. Yet every “AI” we’ve built so far has amplified traits of humanity that we consider flaws, as well as those we consider virtues. Do we expect that this would magically stop being the case if it passed a certain threshhold?
And doesn’t alignment, in the most general terms, get harder when it’s applied to “smarter” entities? If that’s the case, then it seems like the “less smart” entities of human leaders would be a perfect place to test strategies we think will generalize to “smarter” entities. Conversely, if we can’t apply alignment findings to humans because alignment gets “easier” / more tractable when applied to “smarter” entities, doesn’t that suggest a degenerate case of minimum alignment difficulty for a maximally “smart” AI?
The AI problem is easier in some ways (and significantly harder in others) because we’re not taking an existing system and trying to align it. We want to design the system (and/or systems that produce that system, aka optimization) to be aligned in the first place. This can be done through formal work to provide guarantees, lots of code, and lots of testing.
However, doing that for some arbitrary agent or even just a human isn’t really a focus of most alignment research. A human has the issue that they’re already misaligned (in a sense), and there are many various technological/ethical/social issues with either retraining them or performing the modifications to get them aligned. If the ideas that people had for alignment were about ‘converting’ a misaligned intelligence to an aligned one, then humans could maybe be a test-case, but that isn’t really the focus. We also are only ‘slowly’ advancing our ability to understand the body and how the brain works. While we have some of the same issues with neural networks, it is a lot cheaper, less unethical, we can rerun it (for non-dangerous networks), etcetera.
Though, there has been talk of things like incentives, moral mazes, inadequate equilibria and more which are somewhat related to the alignment/misalignment of humans and where they can do better.
Thank you for clarifying! This highlights an assumption about AI so fundamental that I wasn’t previously fully aware that I had it. As you say, there’s a big difference between what to do if we discover AI, vs if we create it. While I think that we as a species are likely to create something that meets our definition of strong AI sooner or later, I consider it vanishingly unlikely that any specific individual or group who goes out trying to create it will actually succeed. So for most of us, especially myself, I figure that on an individual level it’ll be much more like discovering an AI that somebody else created (possibly by accident) than actually creating the thing.
It’s intuitively obvious why alignment work on creating AI doesn’t apply to extant systems. But if the best that the people who care most about it can do is work on created AI without yet applying any breakthroughs to the prospect of a discovered AI (where we can’t count on knowing how it works, ethically create and then destroy a bunch of instances of it, etc)… I think I am beginning to see where we get the meme of how one begins to think hard about these topics and shortly afterward spends a while being extremely frightened.
Ah, what? (I’m reacting to the “every” qualifier here.)
I’d say it comes down to founder effects.
I wouldn’t necessarily call it ‘using AI alignment solutions for human alignment’ though.
Perhaps a better starting point would be: how to discern alignment. And, are there predictable betrayals? Can that situation be improved?
That wasn’t the first place I thought of.
How do you tell if a source is trustworthy? (Of information, or a physical good.)
How do you tell if it’s a good idea for someone to join your team?
Overall, human alignment sounds broad, and interesting.
There’s also some stuff about open source, questions that seem relevant. Less specifically, I read on twitter that:
Elon Musk wants to release the twitter algorithm, and for it develop encrypted chat or stomething. (I read the tweet.)
I think the person in charge of Mastodon (which is already open source) said something about working on encrypted chat as well. (I read the blog post.)
Somehow I feel it’s more likely that Mastodon will end up achieving both conditions than Twitter will.
Two announcements, but one doesn’t inspire much confidence. (How often does a not open source project go open source? Not partially, but fully. I see this as a somewhat general issue (open sourcing probability), not just one of specific context, or ‘the laws of probability say p(a and b) < p(a) or p(b) independently (if a and b are different), and here p(a) and p(a’) are reasonably similar’.)