One way to solve this is to ditch the idea that AI has to some how guess or learn human values and return to “do what I said” + “do not damage”.
“Damage” is more easy to distill, as bad things are computationally simple; death, body harm, civilization destruction is simpler to describe than the nature of life and end state of humanity. This should be done by humans based on exiting laws and experiments.
I feel like, the weirder things get, the more difficult it will be even for humans to make judgments about what constitutes “death, body harm, or civilization destruction.”
Death: is mind-uploading into a computer and/or a brain-in-a-vat death, or transcendence? What about a person who becomes like a prostheticphile character in Rimworld, whose body (and maybe even brain) are more prosthetic enhancement than original human (kind of like Darth Vader, or the Ship of Theseus). At what point do we say that the original person has “died”? For that matter, what counts as “alive”? Fetuses?
Body harm: even today, people disagree over whether transgender transitioning surgeries count as body harm or body enhancement. Ditto with some of the more ambitious types of plastic surgery, or “height enhancement” that involves ambitious procedures like lengthening leg bones. Ditto for synthetic hormones. Is an ASI supposed to listen to progressives or conservatives on these issues?
Civilization destruction: are we already destroying our civilization? We demolished large parts of our streetcar-oriented civilization (including entire neighborhoods, rail lines, etc.) to make way for automobile-centric civilization. Was that a good thing? Was that a net-increase in civilization? Is “wokeism” a net-increase in civilization or a destruction of “Western Civilization”? Which threatens our industrial civilization more: carbon emissions, or regulating carbon emissions?
If we define civilization as just, “we live in cities and states and have division of labor,” then we might be arbitrarily closing off certain appealing possibilities. For example, imagine a future where humans get to live in something resembling their ancestral environment (beautiful, pristine nature), which gives us all of the reward signals of that environment that we are primed to relish, except we also have self-replicating nanobots to make sure that food is always in plentiful supply for hunting/gathering, diseases/insects/animals that are dangerous to humans are either eradicated or kept carefully in check, nanobots repair human cellular machinery so that we live to be 800+ years old on average, etc. That’s a kind of “destruction of civilization” that I might even embrace! (I’d have to think about for a while because it’s still pretty weird, but I wouldn’t rule it out automatically).
I don’t this the self-alignment problem depends of notion of ‘human values’. Also I don’t think the “do what I said” solves it. Do what I said is roughly “aligning with the output of the aggregation procedure”, and
for most non-trivial requests, understanding what I said depends of fairly complex model of what the words I said mean
often there will be a tension between your words; strictly interpreted “do not do damage” can mean “do nothing”—basically anything has some risk of some damage; when you tell a LLM to be “harmless” and “helpful”, these requests point in different directions
strong learners will learn what lead you to say the words anyway
I see connection between self-alignment and human values as following: the idea of human values assumes that human has stable set of preferences. The stability is important part of the idea of human values. But human motivation system is notoriously non-stable: I want to drink, I have drink and now I don’t want to drink. The idea of “desires” may be a better fit than “human values” as it is normal for desires to evolve and contradict each other.
But human motivational system is more complex than that: I have rules and I have desires, which are often contradict each other and are in dynamic balance. For example, I have a rule not to drink alcohol and desire for a drink.
Speaking about you bullet points: everything depends of the situation and there are two main types of situations: a) researchers starts first ever AI first time 2) consumer uses a home robot for a task. In the second case, the robot is likely trained on a very large dataset and knows what are good and bad outcomes for almost all possible situations.
That could work in most cases, but there are some notable exceptions. Such as, having to use AI to deal damage to prevent even bigger damage. “Burn all GPUs”, “spy on all humans so they don’t build AGI”, “research biology/AI/nanotech” etc.
One way to solve this is to ditch the idea that AI has to some how guess or learn human values and return to “do what I said” + “do not damage”.
“Damage” is more easy to distill, as bad things are computationally simple; death, body harm, civilization destruction is simpler to describe than the nature of life and end state of humanity. This should be done by humans based on exiting laws and experiments.
I wrote more on this “Dangerous value learners” and AI Alignment Problem: “Human Values” don’t Actually Exist (the last work was done on AI Safety Camp in Prague, which was a great experience)
I feel like, the weirder things get, the more difficult it will be even for humans to make judgments about what constitutes “death, body harm, or civilization destruction.”
Death: is mind-uploading into a computer and/or a brain-in-a-vat death, or transcendence? What about a person who becomes like a prostheticphile character in Rimworld, whose body (and maybe even brain) are more prosthetic enhancement than original human (kind of like Darth Vader, or the Ship of Theseus). At what point do we say that the original person has “died”? For that matter, what counts as “alive”? Fetuses?
Body harm: even today, people disagree over whether transgender transitioning surgeries count as body harm or body enhancement. Ditto with some of the more ambitious types of plastic surgery, or “height enhancement” that involves ambitious procedures like lengthening leg bones. Ditto for synthetic hormones. Is an ASI supposed to listen to progressives or conservatives on these issues?
Civilization destruction: are we already destroying our civilization? We demolished large parts of our streetcar-oriented civilization (including entire neighborhoods, rail lines, etc.) to make way for automobile-centric civilization. Was that a good thing? Was that a net-increase in civilization? Is “wokeism” a net-increase in civilization or a destruction of “Western Civilization”? Which threatens our industrial civilization more: carbon emissions, or regulating carbon emissions?
If we define civilization as just, “we live in cities and states and have division of labor,” then we might be arbitrarily closing off certain appealing possibilities. For example, imagine a future where humans get to live in something resembling their ancestral environment (beautiful, pristine nature), which gives us all of the reward signals of that environment that we are primed to relish, except we also have self-replicating nanobots to make sure that food is always in plentiful supply for hunting/gathering, diseases/insects/animals that are dangerous to humans are either eradicated or kept carefully in check, nanobots repair human cellular machinery so that we live to be 800+ years old on average, etc. That’s a kind of “destruction of civilization” that I might even embrace! (I’d have to think about for a while because it’s still pretty weird, but I wouldn’t rule it out automatically).
I don’t this the self-alignment problem depends of notion of ‘human values’. Also I don’t think the “do what I said” solves it. Do what I said is roughly “aligning with the output of the aggregation procedure”, and
for most non-trivial requests, understanding what I said depends of fairly complex model of what the words I said mean
often there will be a tension between your words; strictly interpreted “do not do damage” can mean “do nothing”—basically anything has some risk of some damage; when you tell a LLM to be “harmless” and “helpful”, these requests point in different directions
strong learners will learn what lead you to say the words anyway
I see connection between self-alignment and human values as following: the idea of human values assumes that human has stable set of preferences. The stability is important part of the idea of human values. But human motivation system is notoriously non-stable: I want to drink, I have drink and now I don’t want to drink. The idea of “desires” may be a better fit than “human values” as it is normal for desires to evolve and contradict each other.
But human motivational system is more complex than that: I have rules and I have desires, which are often contradict each other and are in dynamic balance. For example, I have a rule not to drink alcohol and desire for a drink.
Speaking about you bullet points: everything depends of the situation and there are two main types of situations: a) researchers starts first ever AI first time 2) consumer uses a home robot for a task. In the second case, the robot is likely trained on a very large dataset and knows what are good and bad outcomes for almost all possible situations.
That could work in most cases, but there are some notable exceptions. Such as, having to use AI to deal damage to prevent even bigger damage. “Burn all GPUs”, “spy on all humans so they don’t build AGI”, “research biology/AI/nanotech” etc.