I really like this post. I also like that you provide concrete and specific observables which you think would obtain under each counterargument. I found it refreshing to imagine so many non-orthodox futures.
Small differences in utility functions may not be catastrophic
For three months, I have been sitting on a post (originally) called “What’s up with humans with different values not wanting to kill each other?”. It seems to me like “value has to be perfect or Goodhart into oblivion” just… doesn’t make sense, that isn’t how the world works AFAICT. But I gave up on pressing this point because I wasn’t able to communicate the obvious-to-me misprediction made by “imperfection” arguments. Maybe I’ll publish that post now. (EDIT: Published!)
Eliezer’s original “value is fragile” argument doesn’t claim that all perturbations (setting to zero) shatter value into oblivion, but rarely do I perceive people to be considering the dimensions along which value is robust, rather than (AFAICT) reasoning “imperfection → Goodhart → doom.” (And “If you didn’t get bored, that’d suck” is importantly different from “If the AI doesn’t care about you being entertained in just the right way, that’d suck”, but I digress...)
AI agents may not be radically superior to combinations of humans and non-agentic machines
On the model of “economic pressure explains a lot of AI outcomes”, I disagree with this counterargument because of intuitions around Ahmdahl’s law (see Gwern here).
However, this feels somewhat irrelevant. It sure feels to me like a better model is “people do things which are cool and push limits, especially if that makes money.” Even if “centaur” human+AI hybrid processes are economically competitive on eg running corporations, I expect Mooglebook to train an agent anyways sooner or later.
Unclear that many goals realistically incentivise taking over the universe
It seems like people often implicitly assume some kind of space-time additivity of utility, where entities want to “tile” the universe with something. That goals are “grabby” by default. This seems plausible to me but not slam-dunk.
(It’s unclear that I should care about far-away people the same as I do about nearby people. Suppose an AI’s main decision-influence is gathering diamonds around it. Should this AI generalize its values to caring about diamonds throughout the cosmos and throughout Tegmark IV? I think “not necessarily.” If not, then “AI goals will tile across spacetime and relaities” seems quantitative and uncertain to me.)
The argument overall proves too much about corporations
I agree. I think many alignment arguments prove too much, or are taken too far, especially without relying on specifics of eg SGD dynamics. For example, selection-style arguments can be useful for considering failure modes, but often seem to be taken as open-and-shut counterarguments to proposals.
“That’s selecting for deception.” So? Evolution selected for wolves with biological sniper rifles, and didn’t get wolves with biological sniper rifles. Evolution selected for humans caring about IGF, and didn’t get humans to care about IGF. (For more, see Reward is not the optimization target, anticipated question no.2)
I really like this post. I also like that you provide concrete and specific observables which you think would obtain under each counterargument. I found it refreshing to imagine so many non-orthodox futures.
For three months, I have been sitting on a post (originally) called “What’s up with humans with different values not wanting to kill each other?”. It seems to me like “value has to be perfect or Goodhart into oblivion” just… doesn’t make sense, that isn’t how the world works AFAICT. But I gave up on pressing this point because I wasn’t able to communicate the obvious-to-me misprediction made by “imperfection” arguments. Maybe I’ll publish that post now. (EDIT: Published!)
Eliezer’s original “value is fragile” argument doesn’t claim that all perturbations (setting to zero) shatter value into oblivion, but rarely do I perceive people to be considering the dimensions along which value is robust, rather than (AFAICT) reasoning “imperfection → Goodhart → doom.” (And “If you didn’t get bored, that’d suck” is importantly different from “If the AI doesn’t care about you being entertained in just the right way, that’d suck”, but I digress...)
On the model of “economic pressure explains a lot of AI outcomes”, I disagree with this counterargument because of intuitions around Ahmdahl’s law (see Gwern here).
However, this feels somewhat irrelevant. It sure feels to me like a better model is “people do things which are cool and push limits, especially if that makes money.” Even if “centaur” human+AI hybrid processes are economically competitive on eg running corporations, I expect Mooglebook to train an agent anyways sooner or later.
It seems like people often implicitly assume some kind of space-time additivity of utility, where entities want to “tile” the universe with something. That goals are “grabby” by default. This seems plausible to me but not slam-dunk.
(It’s unclear that I should care about far-away people the same as I do about nearby people. Suppose an AI’s main decision-influence is gathering diamonds around it. Should this AI generalize its values to caring about diamonds throughout the cosmos and throughout Tegmark IV? I think “not necessarily.” If not, then “AI goals will tile across spacetime and relaities” seems quantitative and uncertain to me.)
I agree. I think many alignment arguments prove too much, or are taken too far, especially without relying on specifics of eg SGD dynamics. For example, selection-style arguments can be useful for considering failure modes, but often seem to be taken as open-and-shut counterarguments to proposals.
“That’s selecting for deception.” So? Evolution selected for wolves with biological sniper rifles, and didn’t get wolves with biological sniper rifles. Evolution selected for humans caring about IGF, and didn’t get humans to care about IGF. (For more, see Reward is not the optimization target, anticipated question no.2)