Augmenting humans to do better alignment research seems like a pretty different proposal to building artificial alignment researchers.
The former is about making (presumed-aligned) humans more intelligent, which is a biology problem, while the latter is about making (presumed-intelligent) AIs aligned, which is a computer science problem.
I think my crux is that if we assume that humans are scalable in intelligence without the assumption that they become misaligned, then it becomes much easier to argue that we’d be able to align AI without having to go through the process, for the reason sketched out by jdp:
I think the crux is an epistemological question that goes something like: “How much can we trust complex systems that can’t be statically analyzed in a reductionistic way?” The answer you give in this post is “way less than what’s necessary to trust a superintelligence”. Before we get into any object level about whether that’s right or not, it should be noted that this same answer would apply to actual biological intelligence enhancement and uploading in actual practice. There is no way you would be comfortable with 300+ IQ humans walking around with normal status drives and animal instincts if you’re shivering cold at the idea of machines smarter than people.
I think you have a wrong model of the process, which comes from conflating outcome-alignment and intent-alignment.
Current LLMs are outcome-aligned, i.e., they produce “good” outputs. But, in pessimist model, internal mechanisms of LLM that produces “good outputs” has nothing common with “being nice” or “caring about humans” and more like “producing weird text patterns” and if we make LLMs sufficiently smarter, they turn the world into text patterns or do something else unpredictable. I.e., it’s not like control structures of LLMs are nice right now and stop being nice when we make LLM smarter, they simply aren’t about “being nice” in the first place.
On the other hand, humans are at least somewhat intent-aligned and if we don’t use really radical rearrangements of brain matter, we can expect them to stay intent-aligned.
Augmenting humans to do better alignment research seems like a pretty different proposal to building artificial alignment researchers.
The former is about making (presumed-aligned) humans more intelligent, which is a biology problem, while the latter is about making (presumed-intelligent) AIs aligned, which is a computer science problem.
I think my crux is that if we assume that humans are scalable in intelligence without the assumption that they become misaligned, then it becomes much easier to argue that we’d be able to align AI without having to go through the process, for the reason sketched out by jdp:
https://www.lesswrong.com/posts/JcLhYQQADzTsAEaXd/?commentId=7iBb7aF4ctfjLH6AC
I think you have a wrong model of the process, which comes from conflating outcome-alignment and intent-alignment. Current LLMs are outcome-aligned, i.e., they produce “good” outputs. But, in pessimist model, internal mechanisms of LLM that produces “good outputs” has nothing common with “being nice” or “caring about humans” and more like “producing weird text patterns” and if we make LLMs sufficiently smarter, they turn the world into text patterns or do something else unpredictable. I.e., it’s not like control structures of LLMs are nice right now and stop being nice when we make LLM smarter, they simply aren’t about “being nice” in the first place. On the other hand, humans are at least somewhat intent-aligned and if we don’t use really radical rearrangements of brain matter, we can expect them to stay intent-aligned.