Yep, this is basically OpenAI’s alignment plan, but worse. IMO I’m pretty bullish on that plan, but yes this is pretty clearly already done, and I’m rather surprised by Eliezer’s comment here.
Augmenting humans to do better alignment research seems like a pretty different proposal to building artificial alignment researchers.
The former is about making (presumed-aligned) humans more intelligent, which is a biology problem, while the latter is about making (presumed-intelligent) AIs aligned, which is a computer science problem.
I think my crux is that if we assume that humans are scalable in intelligence without the assumption that they become misaligned, then it becomes much easier to argue that we’d be able to align AI without having to go through the process, for the reason sketched out by jdp:
I think the crux is an epistemological question that goes something like: “How much can we trust complex systems that can’t be statically analyzed in a reductionistic way?” The answer you give in this post is “way less than what’s necessary to trust a superintelligence”. Before we get into any object level about whether that’s right or not, it should be noted that this same answer would apply to actual biological intelligence enhancement and uploading in actual practice. There is no way you would be comfortable with 300+ IQ humans walking around with normal status drives and animal instincts if you’re shivering cold at the idea of machines smarter than people.
I think you have a wrong model of the process, which comes from conflating outcome-alignment and intent-alignment.
Current LLMs are outcome-aligned, i.e., they produce “good” outputs. But, in pessimist model, internal mechanisms of LLM that produces “good outputs” has nothing common with “being nice” or “caring about humans” and more like “producing weird text patterns” and if we make LLMs sufficiently smarter, they turn the world into text patterns or do something else unpredictable. I.e., it’s not like control structures of LLMs are nice right now and stop being nice when we make LLM smarter, they simply aren’t about “being nice” in the first place.
On the other hand, humans are at least somewhat intent-aligned and if we don’t use really radical rearrangements of brain matter, we can expect them to stay intent-aligned.
The ‘message’ surprised me since it seems to run counter to the whole point of LW.
That non-super-geniuses, mostly just moderately above average folks, can participate and have some chance of producing genuinely novel insights, that future people will actually care to remember. Based on the principle of the supposed wisdom of the userbase ‘masses’ rubbing their ideas together enough times.
Plus a few just-merely-geniuses shepherding them.
But if this method can’t produce any meaningful results in the long term...
OpenAI never advocated for the aforementioned so it isn’t as surprising if they adopt the everything hinges on the future ubermensch plan.
Maybe. But it wouldn’t make sense to judge an approach to a technical problem, alignment, based on what philosophy it was produced with. If we tried that philosophy and it didn’t work, that’s a reasonable thing to say and advocate for.
I don’t think Eliezer’s reasoning for that conclusion is nearly adequate, and we still have almost no idea how hard alignment is, because the conversation has broken down.
Yep, this is basically OpenAI’s alignment plan, but worse. IMO I’m pretty bullish on that plan, but yes this is pretty clearly already done, and I’m rather surprised by Eliezer’s comment here.
Augmenting humans to do better alignment research seems like a pretty different proposal to building artificial alignment researchers.
The former is about making (presumed-aligned) humans more intelligent, which is a biology problem, while the latter is about making (presumed-intelligent) AIs aligned, which is a computer science problem.
I think my crux is that if we assume that humans are scalable in intelligence without the assumption that they become misaligned, then it becomes much easier to argue that we’d be able to align AI without having to go through the process, for the reason sketched out by jdp:
https://www.lesswrong.com/posts/JcLhYQQADzTsAEaXd/?commentId=7iBb7aF4ctfjLH6AC
I think you have a wrong model of the process, which comes from conflating outcome-alignment and intent-alignment. Current LLMs are outcome-aligned, i.e., they produce “good” outputs. But, in pessimist model, internal mechanisms of LLM that produces “good outputs” has nothing common with “being nice” or “caring about humans” and more like “producing weird text patterns” and if we make LLMs sufficiently smarter, they turn the world into text patterns or do something else unpredictable. I.e., it’s not like control structures of LLMs are nice right now and stop being nice when we make LLM smarter, they simply aren’t about “being nice” in the first place. On the other hand, humans are at least somewhat intent-aligned and if we don’t use really radical rearrangements of brain matter, we can expect them to stay intent-aligned.
The ‘message’ surprised me since it seems to run counter to the whole point of LW.
That non-super-geniuses, mostly just moderately above average folks, can participate and have some chance of producing genuinely novel insights, that future people will actually care to remember. Based on the principle of the supposed wisdom of the userbase ‘masses’ rubbing their ideas together enough times.
Plus a few just-merely-geniuses shepherding them.
But if this method can’t produce any meaningful results in the long term...
OpenAI never advocated for the aforementioned so it isn’t as surprising if they adopt the everything hinges on the future ubermensch plan.
Maybe. But it wouldn’t make sense to judge an approach to a technical problem, alignment, based on what philosophy it was produced with. If we tried that philosophy and it didn’t work, that’s a reasonable thing to say and advocate for.
I don’t think Eliezer’s reasoning for that conclusion is nearly adequate, and we still have almost no idea how hard alignment is, because the conversation has broken down.