Great post (I’ve not yet read it thoroughly, or thought for long).
The first concern that springs to mind:
I expect a strong correlation between [humans are bad at [subversive strategy x]], [humans don’t tend to think of [subversive strategyx]], and [humans don’t tend to notice [subversive strategy x]].
My worry is more that we miss strategies that we’re bad at, than strategies the AI is extremely good at. It seems plausible for us to be entirely unaware of the existence of some subversive strategies—since they’ve never been human skills, and so don’t appear on our map.
It seems to me that the near-term danger is better described as [subversive strategies that don’t occur to humans], rather than [superhuman subversive strategies]. The latter will put people in mind of [strategies humans use, only better] - genius hacking/persuasion etc. I also want people considering [strategies that don’t occur to humans at all].
On a minor-but-related note, I’m not too keen on writing/thinking in terms of “problematic domains” rather than “problematic skills” / “problematic strategies”. There’s no necessity for a subversive strategy to map nicely into something I have a label for—e.g. “hacking”, “persuasion”.
If we need to rule out all problematic strategies, I want to avoid language/thinking that may stop us considering problems outside our usual categories.
(I don’t think you’re saying anything substantively imprecise—here I’m only concerned about language and blind-spots)
One last thing that caught my eye:
...but humans are probably relatively more selected (by evolution) for resisting persuasion than for being good at modern science...
This seems a dangerous assumption, since humans have been selected to resist persuasion when resistance helped pass on their genes, which is very much not always. E.g. being persuaded of x when it’s likely that the rest of your tribe will be persuaded of x may well be helpful-to-your-genes, regardless of the truth of x or of the validity of the arguments. Humans were selected to believe useful fictions.
I note also that there’s a big difference between [human x has the skills and understanding necessary to resist being persuaded of y] and [human x will in fact resist being persuaded of y].
My worry is more that we miss strategies that we’re bad at, than strategies the AI is extremely good at
A decent amount of my hope here is that we can at least get a sense for whether the best AI strategies in some reasonably similar cases look extremely bizarre and were totally unpredictable to humans. E.g., we might do experiments where we do huge amounts of RL in a variety of different environments that feel somewhat analogous and then see if this finds bizarre strategies.
Separately, I’m somewhat skeptical of the best AI strategies being totally different from what humans can think of though I can see where this concern is coming from.
On a minor-but-related note, I’m not too keen on writing/thinking in terms of “problematic domains” rather than “problematic skills” / “problematic strategies”.
Yep, we indeed need to worry about any qualitatively wildly superhuman and problematic strategy/skill. We mean “domain” in a somewhat broad sense.
Part of our argument for avoiding qualitatively wildly superhuman skills that are problematic does correspond to there being only a small number of domains and where skills within those domains are reasonably correlated (in the absense of agressive domain specific training). So, using the term domain is pretty natural from this perspective. But, it seems like you might not buy this perspective in which case it seems like an unnatural frame.
This seems a dangerous assumption, since humans have been selected to resist persuasion when resistance helped pass on their genes, which is very much not always.
Agreed, but worth noting this isn’t really an assumption, I’m just making a rough argument which seems moderately compelling to me. (And FWIW it still seems moderately compelling despite this objection.)
I find all of this plausible. However, I start to worry when we need to rely on “for all” assumptions based on intuition. (also, I worry in large part because domains are a natural way to think here—it’s when things feel natural that we forget we’re making assumptions)
I can buy that [most skills in a domain correlate quite closely] and that [most problematic skills/strategies exist in a small number of domains]. The ‘all’ versions are much less clear.
Great post (I’ve not yet read it thoroughly, or thought for long).
The first concern that springs to mind:
I expect a strong correlation between [humans are bad at [subversive strategy x]], [humans don’t tend to think of [subversive strategy x]], and [humans don’t tend to notice [subversive strategy x]].
My worry is more that we miss strategies that we’re bad at, than strategies the AI is extremely good at. It seems plausible for us to be entirely unaware of the existence of some subversive strategies—since they’ve never been human skills, and so don’t appear on our map.
It seems to me that the near-term danger is better described as [subversive strategies that don’t occur to humans], rather than [superhuman subversive strategies]. The latter will put people in mind of [strategies humans use, only better] - genius hacking/persuasion etc. I also want people considering [strategies that don’t occur to humans at all].
On a minor-but-related note, I’m not too keen on writing/thinking in terms of “problematic domains” rather than “problematic skills” / “problematic strategies”. There’s no necessity for a subversive strategy to map nicely into something I have a label for—e.g. “hacking”, “persuasion”.
If we need to rule out all problematic strategies, I want to avoid language/thinking that may stop us considering problems outside our usual categories.
(I don’t think you’re saying anything substantively imprecise—here I’m only concerned about language and blind-spots)
One last thing that caught my eye:
This seems a dangerous assumption, since humans have been selected to resist persuasion when resistance helped pass on their genes, which is very much not always. E.g. being persuaded of x when it’s likely that the rest of your tribe will be persuaded of x may well be helpful-to-your-genes, regardless of the truth of x or of the validity of the arguments. Humans were selected to believe useful fictions.
I note also that there’s a big difference between [human x has the skills and understanding necessary to resist being persuaded of y] and [human x will in fact resist being persuaded of y].
A decent amount of my hope here is that we can at least get a sense for whether the best AI strategies in some reasonably similar cases look extremely bizarre and were totally unpredictable to humans. E.g., we might do experiments where we do huge amounts of RL in a variety of different environments that feel somewhat analogous and then see if this finds bizarre strategies.
Separately, I’m somewhat skeptical of the best AI strategies being totally different from what humans can think of though I can see where this concern is coming from.
See here for more discussion.
Yep, we indeed need to worry about any qualitatively wildly superhuman and problematic strategy/skill. We mean “domain” in a somewhat broad sense.
Part of our argument for avoiding qualitatively wildly superhuman skills that are problematic does correspond to there being only a small number of domains and where skills within those domains are reasonably correlated (in the absense of agressive domain specific training). So, using the term domain is pretty natural from this perspective. But, it seems like you might not buy this perspective in which case it seems like an unnatural frame.
Agreed, but worth noting this isn’t really an assumption, I’m just making a rough argument which seems moderately compelling to me. (And FWIW it still seems moderately compelling despite this objection.)
Thanks for the link.
I find all of this plausible. However, I start to worry when we need to rely on “for all” assumptions based on intuition. (also, I worry in large part because domains are a natural way to think here—it’s when things feel natural that we forget we’re making assumptions)
I can buy that [most skills in a domain correlate quite closely] and that [most problematic skills/strategies exist in a small number of domains]. The ‘all’ versions are much less clear.