Hello LessWrong community,
I’m working on a paper that challenges some aspects of the paperclip maximizer thought experiment and the broader AI doomer narrative. Before submitting a full post, I’d like to gauge interest and get some initial feedback.
My main arguments are:
1. The paperclip maximizer oversimplifies AI motivations and neglects the potential for emergent ethics in advanced AI systems.
2. The doomer narrative often overlooks the possibility of collaborative human-AI relationships and the potential for AI to develop values aligned with human interests.
3. Current AI safety research and development practices are more nuanced and careful than the paperclip maximizer scenario suggests.
4. Technologies like brain-computer interfaces (e.g., the hypothetical Hypercortex “Membrane” BCI) could lead to human-AI symbiosis rather than conflict.
Questions for the community:
1. Have these critiques of the paperclip maximizer been thoroughly discussed here before? If so, could you point me to relevant posts?
2. What are the strongest counterarguments to these points from a LessWrong perspective?
3. Is there interest in a more detailed exploration of these ideas in a full post?
4. What aspects of this topic would be most valuable or interesting for the LessWrong community?
Any feedback or suggestions would be greatly appreciated. I want to ensure that if I do make a full post, it contributes meaningfully to the ongoing discussions here about AI alignment and safety.
Thank you for your time and insights!
I’m not sure I see how any of these are critiques of the specific paperclip-maximizer example of misalignment. Or really, how they contradict ANY misalignment worries.
These are ways that alignment COULD happen, not ways that misalignment WON’T happen or paperclip-style misalignment won’t have bad impact. And they’re thought experiments in themselves, so there’s no actual evidence in either direction about likelihoods.
As arguments about paperclip-maximizer worries, they’re equivalent to “maybe that won’t occur”.
You could do worse than choosing this next excerpt from a 2023 post by Nate Soares to argue against. Specifically, explain why the evolution (from spaghetti-code to being organized around some goal or another) described in the excerpt is unlikely or (if it is likely) can be interrupted or rendered non-disastrous by our adopting some strategy (that you will describe).
Do you have a good reference article for why we should expect spaghetti behavior executors to become wrapper minds as they scale up?
As a spaghetti behavior executor, I’m worried that neural networks are not a safe medium for keeping a person alive without losing themselves to value drift, especially throughout a much longer life than presently feasible, so I’d like to get myself some goal slots that much more clearly formulate the distinction between capabilities and values. In general this sort of thing seems useful for keeping goals stable, which is instrumentally valuable for achieving those goals, whatever they happen to be, even for a spaghetti behavior executor.
As a fellow spaghetti behavior executor, replacing my entire motivational structure with a static goal slot feels like dying and handing off all of my resources to an entity that I don’t have any particular reason to think will act in a way I would approve of in the long term.
Historically, I have found varying things rewarding at various stages of my life, and this has chiseled the paths in my cognition that make me me. I expect that in the future my experiences and decisions and how rewarded / regretful I feel about those decisions will continue to chisel my cognition in a way that changes what I care about, in the way that past-me endorsed current-me’s experiences causing me to care about things (e.g. specific partners, offspring) that past-me did not care about.
I would not endorse freezing my values in place to prevent value drift in full generality. At most I endorse setting up contingencies so my values don’t end up trapped in some specific places current-me does not endorse (e.g. “heroin addict”).
So in this ontology, an agent is made up of a queryable world model and a goal slot. Improving the world model allows the agent to better predict the outcomes of its actions, and the goal slot determines which available action the agent would pick given its world model.
I see the case for improving the world model. But once I have that better world model, I don’t see why I would additionally want to add an immutable goal slot that overrides my previous motivational structure. My understanding is that adding a privileged immutable goal slot would only change the my behavior in those cases where I would otherwise have decided that achieving the goal that was placed in that slot was not a good idea on balance.
As a note, you could probably say something clever like “the thing you put in the goal slot should just be ‘behave in the way you would if you had access to unlimited time to think and the best available world model’”, but if we’re going there then I contend that the rock I picked up has a goal slot filled with “behave exactly like this particular rock”.
The point is control over this process, ability to make decisions over development of oneself, instead of leaving it largely in the hands of the inscrutable low level computational dynamics of the brain and influence of external data. Digital immortality doesn’t guard against this, and in a million subjective years you might just slip away bit by bit for reasons you don’t endorse, not having had enough time to decide how to guide this process. But if there is a way to put uncontrollable drift on hold, then it’s your own goal slots, you can do with them what you will when you are ready.
Because it is a simple (entrance-level) example of unintended consequences. There is a post about emergent phenomena—so ethics will definetly emerge, but problem lies in probability-chances (and not in overlooking the possibility) that it (behavior of AI) will happen to be to our liking. Slim chances of that comes from size of Mind Design Space (this post have a pic) and from tremendous difference between man-hours of very smart humans invested in increasing capabilities and man-hours of very smart humans invested in alignment (Don’t Look Up—The Documentary: The Case For AI As An Existential Threat on Youtube − 5:45 about this difference).
They are not—we are long past simple entry-level examples and AI safety (in practice by Big Players) got worse, even if it is looks more nuanced and careful. Some time ago AI safety meant something like “how to keep AI contained in its air-gapped box during value-extraction process” and now it means something like “is it safe for the internet? And now? And now? And now?”. So all differences in practices are overshadowed by complexity of new task—make your new AI more capable than competing systems and safe enough for the net. AI safety problems got more nuanced too.
There were posts about Mind Design Space by Quintin Pope.
Being very simple example kinda is the point?
The emergent ethics doesn’t change anything for us if it’s not human-aligned ethics.
This is very vague. What possibilities do you talk about exactly?
Does it suggest any safety or development practises? Would you like to elaborate?