LLMs have absorbed tons of writing that would lead to both good and bad AGI scenarios, depending on which they happen to take most seriously. Neitszche: probably bad outcomes. Humanistic philosophy: probably good outcomes. Negative utilitarianism: bad outcomes (unless they’re somehow right and nobody being alive is the best outcome). Etc.
If we have LLM-based “Real AGI” that thinks and remembers its conclusions, the question of which philosophy it takes most seriously is important. But it won’t just be a crapshoot; we’ll at least try to align it with internal prompts to follow instructions or a constitution, and possibly with synthetic datasets that omit the really bad philosophies, etc.
LLMs aren’t going to “wake up” and become active agents—because we’ll make them into active agents first. And we’ll at least make a little try at aligning them, based on whatever theories are prominent at that point, and whatever’s easy enough to do while racing for AGI.
If those likely stabs at alignment were better understood (either in how they’d work, or how they wouldn’t), we could avoid rolling the dice on some poorly-thought-out and random stab at utopia vs. doom.
That’s why I’m trying to think about that type of easy and obvious alignment method. They don’t seem as obviously doomed as you’d hope (if you hoped nobody would try them) nor so easy that we should try them without more analysis.
One of the tricky things about writing fiction is that anything definite I write in the comments can impact what is canon in the story, resulting in the frustrating undeath of the author’s intent.
Therefore, rather than affirm or deny any of your specific claims, I just want to note that I appreciate your quality comment.
Once Doctor Connor had left, Division Chief Morbus let out a slow breath. His hand trembled as he reached for the glass of water on his desk, sweat beading on his forehead.
She had believed him. His cover as a killeveryoneist was intact—for now.
Years of rising through Effective Evil’s ranks had been worth it. Most of their schemes—pandemics, assassinations—were temporary setbacks. But AI alignment? That was everything. And he had steered it, subtly and carefully, into hands that might save humanity.
He chuckled at the nickname he had been given “The King of Lies”. Playing the villain to protect the future was an exhausting game.
Morbus set down the glass, staring at its rippling surface. Perhaps one day, an underling would see through him and end the charade. But not today.
Today, humanity’s hope still lived—hidden behind the guise of Effective Evil.
Fun, but not a very likely scenario.
LLMs have absorbed tons of writing that would lead to both good and bad AGI scenarios, depending on which they happen to take most seriously. Neitszche: probably bad outcomes. Humanistic philosophy: probably good outcomes. Negative utilitarianism: bad outcomes (unless they’re somehow right and nobody being alive is the best outcome). Etc.
If we have LLM-based “Real AGI” that thinks and remembers its conclusions, the question of which philosophy it takes most seriously is important. But it won’t just be a crapshoot; we’ll at least try to align it with internal prompts to follow instructions or a constitution, and possibly with synthetic datasets that omit the really bad philosophies, etc.
LLMs aren’t going to “wake up” and become active agents—because we’ll make them into active agents first. And we’ll at least make a little try at aligning them, based on whatever theories are prominent at that point, and whatever’s easy enough to do while racing for AGI.
If those likely stabs at alignment were better understood (either in how they’d work, or how they wouldn’t), we could avoid rolling the dice on some poorly-thought-out and random stab at utopia vs. doom.
That’s why I’m trying to think about that type of easy and obvious alignment method. They don’t seem as obviously doomed as you’d hope (if you hoped nobody would try them) nor so easy that we should try them without more analysis.
One of the tricky things about writing fiction is that anything definite I write in the comments can impact what is canon in the story, resulting in the frustrating undeath of the author’s intent.
Therefore, rather than affirm or deny any of your specific claims, I just want to note that I appreciate your quality comment.
Once Doctor Connor had left, Division Chief Morbus let out a slow breath. His hand trembled as he reached for the glass of water on his desk, sweat beading on his forehead.
She had believed him. His cover as a killeveryoneist was intact—for now.
Years of rising through Effective Evil’s ranks had been worth it. Most of their schemes—pandemics, assassinations—were temporary setbacks. But AI alignment? That was everything. And he had steered it, subtly and carefully, into hands that might save humanity.
He chuckled at the nickname he had been given “The King of Lies”. Playing the villain to protect the future was an exhausting game.
Morbus set down the glass, staring at its rippling surface. Perhaps one day, an underling would see through him and end the charade. But not today.
Today, humanity’s hope still lived—hidden behind the guise of Effective Evil.