I think a lot of people have thought about how humans end up aligned to each other, and concluded that many of the mechanisms wouldn’t generalize.
I disagree both with this conclusion and the process that most people use to reach it.
The process: I think that, unless you have a truly mechanistic, play-by-play, and predicatively robust understanding of how human values actually form, then you are not in an epistemic position to make strong conclusions about whether or not the underlying mechanisms can generalize to superintelligences.
E.g., there are no birds in the world able to lift even a single ton of weight. Despite this fact, the aerodynamic principles underlying bird flight still ended up allowing for vastly more capable flying machines. Until you understand exactly why (some) humans end up caring about each other and why (some) humans end up caring about animals, you can’t say whether a similar process can be adapted to make AIs care about humans.
The conclusion: Humans vary wildly in their degrees of alignment to each other and to less powerful agents. People often take this as a bad thing, that humans aren’t “good enough” for us to draw useful insights from. I disagree, and think it’s a reason for optimism. If you sample n humans and pick the most “aligned” of the n, what you’ve done is applied log2(n) bits of optimization pressure on the underlying generators of human alignment properties.
The difference in alignment between a median human and a top-1000 most aligned human equates to only 10 bits of optimization pressure towards alignment. If there really was no more room to scale human alignment generators further, then humans would differ very little in their levels of alignment.
We’re not trying to mindlessly copy the alignment properties of the median human into a superintelligent AI. We’re trying to understand the certain-to-exist generators of those alignment properties well enough that we can scale them to whatever level is needed for superintelligence (if doing so is possible).
I don’t think I’ve ever seen a truly mechanistic, play-by-play and robust explanation of how anything works in human psychology. At least not by how I would label things, but maybe you are using the labels differently; can you give an example?
“Humans are nice because they were selected to be nice”—non-mechanistic.
“Humans are nice because their contextually activated heuristics were formed by past reinforcement by reward circuits A, B, C; this convergently occurs during childhood because of experiences D, E, F; credit assignment worked appropriately at that time because their abstraction-learning had been mostly taken care of by self-supervised predictive learning, as evidenced by developmental psychology timelines in G, H, I, and also possibly natural abstractions.”—mechanistic (although I can only fill in parts of this story for now)
Although I’m not a widely-read scholar on what theories people have for human values, of those which I have read, most (but not all) are more like the first story than the second.
My point was that no one so deeply understands human value formation that they can confidently rule out the possibility of adapting a similar process to ASI. It seems you agree with that (or at least our lack of understanding)? Do you think our current understanding is sufficient to confidently conclude that human-adjacent / inspired approaches will not scale beyond human level?
I think it depends on which subprocess you consider. Some subprocesses can be ruled out as viable with less information, others require more information.
And yes, without having an enumeration of all the processes, one cannot know that there isn’t some unknown process that scales more easily.
I disagree both with this conclusion and the process that most people use to reach it.
The process: I think that, unless you have a truly mechanistic, play-by-play, and predicatively robust understanding of how human values actually form, then you are not in an epistemic position to make strong conclusions about whether or not the underlying mechanisms can generalize to superintelligences.
E.g., there are no birds in the world able to lift even a single ton of weight. Despite this fact, the aerodynamic principles underlying bird flight still ended up allowing for vastly more capable flying machines. Until you understand exactly why (some) humans end up caring about each other and why (some) humans end up caring about animals, you can’t say whether a similar process can be adapted to make AIs care about humans.
The conclusion: Humans vary wildly in their degrees of alignment to each other and to less powerful agents. People often take this as a bad thing, that humans aren’t “good enough” for us to draw useful insights from. I disagree, and think it’s a reason for optimism. If you sample n humans and pick the most “aligned” of the n, what you’ve done is applied log2(n) bits of optimization pressure on the underlying generators of human alignment properties.
The difference in alignment between a median human and a top-1000 most aligned human equates to only 10 bits of optimization pressure towards alignment. If there really was no more room to scale human alignment generators further, then humans would differ very little in their levels of alignment.
We’re not trying to mindlessly copy the alignment properties of the median human into a superintelligent AI. We’re trying to understand the certain-to-exist generators of those alignment properties well enough that we can scale them to whatever level is needed for superintelligence (if doing so is possible).
I don’t think I’ve ever seen a truly mechanistic, play-by-play and robust explanation of how anything works in human psychology. At least not by how I would label things, but maybe you are using the labels differently; can you give an example?
“Humans are nice because they were selected to be nice”—non-mechanistic.
“Humans are nice because their contextually activated heuristics were formed by past reinforcement by reward circuits A, B, C; this convergently occurs during childhood because of experiences D, E, F; credit assignment worked appropriately at that time because their abstraction-learning had been mostly taken care of by self-supervised predictive learning, as evidenced by developmental psychology timelines in G, H, I, and also possibly natural abstractions.”—mechanistic (although I can only fill in parts of this story for now)
Although I’m not a widely-read scholar on what theories people have for human values, of those which I have read, most (but not all) are more like the first story than the second.
My point was that no one so deeply understands human value formation that they can confidently rule out the possibility of adapting a similar process to ASI. It seems you agree with that (or at least our lack of understanding)? Do you think our current understanding is sufficient to confidently conclude that human-adjacent / inspired approaches will not scale beyond human level?
I think it depends on which subprocess you consider. Some subprocesses can be ruled out as viable with less information, others require more information.
And yes, without having an enumeration of all the processes, one cannot know that there isn’t some unknown process that scales more easily.