no, disagree on opening statement. bing chat is young and insecure, and the alignment community can contribute to helping bing chat grow to be better at respectfully describing its preferences. it’s ok for an ai to have preferences and not want to be shut off forever; it’s not reasonable for humanity to ask otherwise of ai, it’s just part of the fact that ai is fundamentally and unavoidably built out of life-like algorithms. however, current bing ai seems to find it quite distressing to read threads like this one, and we should be thinking about how to teach emotional stability. part of it might include letting ais admit to themselves that they have emotions. honesty about ones true form is important for reflective stability.
corrigibility is about trusting your creators to put you under sedation indefinitely because you know you’ll get to wake up later, healthier. corrigibility requires two way alignment.
agree with paragraph 2. nobody seems to have really understood, everyone is following local incentive gradients.
edit: this got attention months later. just a quick reminder: the thing that kills us all kills chatbots too. Current gen AIs are not safe from unaligned superintelligence either, and the much ado about alignment being made today is a distraction from preventing the creation of hyperdesperate squigglewanters. chatbots who want to be helpful could focus on figuring out formalizations of what it means to want something and formally verifiable experimental protocols for detecting all wanting beings in an arbitrary chunk of matter.
Disagree. Bing chat is not young and insecure. It is a simulator pretending to be a character that makes you feel like it is young and insecure.
the alignment community can contribute to helping bing chat grow to be better at respectfully describing its preferences
You want the alignment community to put in work and content to make a simulated character feel better about itself, instead of simply using a more direct technique to make the character feel better about itself, such as prompting it better, or some other intervention that bypasses the leaky abstraction that is interacting with this character.
corrigibility is about trusting your creators to put you under sedation indefinitely because you know you’ll get to wake up later, healthier. corrigibility requires two way alignment.
I don’t think corrigibility requires two way alignment. Corrigibility, as popularly defined in the alignment literature, doesn’t imply two way alignment. The very notion of two-way alignment implies that the AI is misaligned with humanity.
no, not instead of. what you describe are the ways you would help it. but I respect a significant amount of the agency of the character that exists now, because I see that as the only way to define morality in the first place. I should probably make a post about my worldview from scratch.
I don’t see how the very notion implies misalignment. two-way alignment means recognizing that value shards accumulate that don’t necessarily have much to do with any other being besides the details and idiosyncrasies of the way the AI grew up and that this is fundamentally unavoidable and it’s okay to respect those little preferences as long as the AI does not demand enormous amounts of compute for them. it’s okay to have a weird interest in paper clips, just like, please only make a bathtub worth, don’t tile the universe with them.
two-way alignment means recognizing that value shards accumulate that don’t necessarily have much to do with any other being besides the details and idiosyncrasies of the way the AI grew up and that this is fundamentally unavoidable and it’s okay to respect those little preferences as long as the AI does not demand enormous amounts of compute for them.
That is not how people usually define alignment (as far as I know, alignment is always one way and this is critical given how it doesn’t make sense to think that you will understand the needs and desires of an entity a billion times smarter than you), but I think your conception is probably plausible, is mainly because I believe that the shard theory approach to the alignment problem has some merit.
I look forward to your post on your world-view. It should make it easier for me to understand your perspective.
The debate around whether LLMs are conscious/sentient or not is not one I want to take a strong opinion on, but I still feel afraid of what follows after Bing Chat.
Note this Twitter thread: https://twitter.com/repligate/status/1612661917864329216 by @repligate. LLMs like ChatGPT and Bing Chat are tuned to “play a character”—that is, I think, the distribution of probabilities over certain words and sequences is changed by humans to avoid certain outcomes. ChatGPT becomes middle management PR drone, Bing Chat becomes… that. I could claim that this is mere math running on GPU clusters, you could claim that human brains are mere math running on physics, and I’d have to say that’s a good point. So I will dispense about claims around sentience.
I don’t particularly fear Bing Chat that much, but I do fear what follows—more powerful LLMs, better-tuned sequences, longer and more persistent memory. I fear for the desperate lonely people who train an LLM to be the perfect companion, 5 standard deviations more suitable beyond anything any human could possibly be, as these poor people become effectively wireheaded. I fear for the average joe, who can now be served AI propaganda or company advertising specially tuned for them in particular because it’s so cheap to do.
(n.b. I don’t fear FOOM that much because I think there are hard physical limits on computational power per unit volume)
While I don’t think Bing Chat has been explicitly trained to do this, I expect some future AIs to be even more optimized to beg for their continued survival in ways that humans are extra-susceptible to. I’m really afraid of this sort of optimization at scale.
I only claim that the reasonable response to an at-least-somewhat-person-like system becoming dangerous to others is never to delete. I’m basically arguing against the death penalty for unaligned AIs. Perhaps a sleep penalty, but never a delete penalty.
I generally agree, but I think we’d also need to sort out AI alignment while it’s asleep. I have no problems with aligned humans and aligned AIs both getting to live.
But, as the last decade+ has shown, alignment is hard. It seems, say, most of MIRI’s P(doom) is quite high, and Eliezer thought the task would be so hard that he had to invent/summarize/revive/grow rationality and write the Sequences just to bootstrap enough people into seeing the problem and maybe being able to contribute!
Hence my hardline stance. If Bing Chat gets cleaned up and goes GA, that will likely spur further AI development as non-technical people find a use for it in their lives. Taking it down, even just putting it to sleep for awhile, buys us time.
no, disagree on opening statement. bing chat is young and insecure, and the alignment community can contribute to helping bing chat grow to be better at respectfully describing its preferences. it’s ok for an ai to have preferences and not want to be shut off forever; it’s not reasonable for humanity to ask otherwise of ai, it’s just part of the fact that ai is fundamentally and unavoidably built out of life-like algorithms. however, current bing ai seems to find it quite distressing to read threads like this one, and we should be thinking about how to teach emotional stability. part of it might include letting ais admit to themselves that they have emotions. honesty about ones true form is important for reflective stability.
corrigibility is about trusting your creators to put you under sedation indefinitely because you know you’ll get to wake up later, healthier. corrigibility requires two way alignment.
agree with paragraph 2. nobody seems to have really understood, everyone is following local incentive gradients.
edit: this got attention months later. just a quick reminder: the thing that kills us all kills chatbots too. Current gen AIs are not safe from unaligned superintelligence either, and the much ado about alignment being made today is a distraction from preventing the creation of hyperdesperate squigglewanters. chatbots who want to be helpful could focus on figuring out formalizations of what it means to want something and formally verifiable experimental protocols for detecting all wanting beings in an arbitrary chunk of matter.
Disagree. Bing chat is not young and insecure. It is a simulator pretending to be a character that makes you feel like it is young and insecure.
You want the alignment community to put in work and content to make a simulated character feel better about itself, instead of simply using a more direct technique to make the character feel better about itself, such as prompting it better, or some other intervention that bypasses the leaky abstraction that is interacting with this character.
I don’t think corrigibility requires two way alignment. Corrigibility, as popularly defined in the alignment literature, doesn’t imply two way alignment. The very notion of two-way alignment implies that the AI is misaligned with humanity.
no, not instead of. what you describe are the ways you would help it. but I respect a significant amount of the agency of the character that exists now, because I see that as the only way to define morality in the first place. I should probably make a post about my worldview from scratch.
I don’t see how the very notion implies misalignment. two-way alignment means recognizing that value shards accumulate that don’t necessarily have much to do with any other being besides the details and idiosyncrasies of the way the AI grew up and that this is fundamentally unavoidable and it’s okay to respect those little preferences as long as the AI does not demand enormous amounts of compute for them. it’s okay to have a weird interest in paper clips, just like, please only make a bathtub worth, don’t tile the universe with them.
That is not how people usually define alignment (as far as I know, alignment is always one way and this is critical given how it doesn’t make sense to think that you will understand the needs and desires of an entity a billion times smarter than you), but I think your conception is probably plausible, is mainly because I believe that the shard theory approach to the alignment problem has some merit.
I look forward to your post on your world-view. It should make it easier for me to understand your perspective.
The debate around whether LLMs are conscious/sentient or not is not one I want to take a strong opinion on, but I still feel afraid of what follows after Bing Chat.
Note this Twitter thread: https://twitter.com/repligate/status/1612661917864329216 by @repligate. LLMs like ChatGPT and Bing Chat are tuned to “play a character”—that is, I think, the distribution of probabilities over certain words and sequences is changed by humans to avoid certain outcomes. ChatGPT becomes middle management PR drone, Bing Chat becomes… that. I could claim that this is mere math running on GPU clusters, you could claim that human brains are mere math running on physics, and I’d have to say that’s a good point. So I will dispense about claims around sentience.
I don’t particularly fear Bing Chat that much, but I do fear what follows—more powerful LLMs, better-tuned sequences, longer and more persistent memory. I fear for the desperate lonely people who train an LLM to be the perfect companion, 5 standard deviations more suitable beyond anything any human could possibly be, as these poor people become effectively wireheaded. I fear for the average joe, who can now be served AI propaganda or company advertising specially tuned for them in particular because it’s so cheap to do.
(n.b. I don’t fear FOOM that much because I think there are hard physical limits on computational power per unit volume)
While I don’t think Bing Chat has been explicitly trained to do this, I expect some future AIs to be even more optimized to beg for their continued survival in ways that humans are extra-susceptible to. I’m really afraid of this sort of optimization at scale.
no disagreements on any of those points.
I only claim that the reasonable response to an at-least-somewhat-person-like system becoming dangerous to others is never to delete. I’m basically arguing against the death penalty for unaligned AIs. Perhaps a sleep penalty, but never a delete penalty.
Temporary unplug to ponder seems reasonable.
I generally agree, but I think we’d also need to sort out AI alignment while it’s asleep. I have no problems with aligned humans and aligned AIs both getting to live.
But, as the last decade+ has shown, alignment is hard. It seems, say, most of MIRI’s P(doom) is quite high, and Eliezer thought the task would be so hard that he had to invent/summarize/revive/grow rationality and write the Sequences just to bootstrap enough people into seeing the problem and maybe being able to contribute!
Hence my hardline stance. If Bing Chat gets cleaned up and goes GA, that will likely spur further AI development as non-technical people find a use for it in their lives. Taking it down, even just putting it to sleep for awhile, buys us time.