paulfchristiano comments on What failure looks like

paulfchristiano 27 Mar 2019 16:12 UTC
LW: 22 AF: 10
AF
My median outcome is that people solve intent alignment well enough to avoid catastrophe. Amongst the cases where we fail, my median outcome is that people solve enough of alignment that they can avoid the most overt failures, like literally compromising sensors and killing people (at least for a long subjective time), and can build AIs that help defend them from other AIs. That problem seems radically easier—most plausible paths to corrupting sensors involve intermediate stages with hints of corruption that could be recognized by a weaker AI (and hence generate low reward). Eventually this will break down, but it seems quite late.
very confident that no AI company would implement something with this vulnerability?
The story doesn’t depend on “no AI company” implementing something that behaves badly, it depends on people having access to AI that behaves well.
Also “very confident” seems different from “most likely failure scenario.”
Haven’t you yourself written about the failure modes of ‘do things predicted to lead to videos that people rate as acceptable’ where the attack involves surreptitiously reprogramming the camera to get optimal videos (including weird engineered videos designed to optimize on infelicities in the learned objective?
That’s a description of the problem / the behavior of the unaligned benchmark, not the most likely outcome (since I think the problem is most likely to be solved). We may have a difference in view between a distribution over outcomes that is slanted towards “everything goes well” such that the most realistic failures are the ones that are the closest calls, vs. a distribution slanted towards “everything goes badly” such that the most realistic failures are the complete and total ones where you weren’t even close.
Because it definitely seems that Vox got the impression from it that there is never a robot army takeover in the scenario, not that it’s slightly preceded by camera hacking.
I agree there is a robot takeover shortly later in objective time (mostly because of the singularity). Exactly how long it is mostly depends on how early things go off the rails w.r.t. alignment, perhaps you have O(year).
- CarlShulman 27 Mar 2019 19:25 UTC
  LW: 21 AF: 8
  AF Parent
  OK, thanks for the clarification!
  My own sense is that the intermediate scenarios are unstable: if we have fairly aligned AI we immediately use it to make more aligned AI and collectively largely reverse things like Facebook click-maximization manipulation. If we have lost the power to reverse things then they go all the way to near-total loss of control over the future. So i would tend to think we wind up in the extremes.
  I could imagine a scenario where there is a close balance among multiple centers of AI+human power, and some but not all of those centers have local AI takeovers before the remainder solve AI alignment, and then you get a world that is a patchwork of human-controlled and autonomous states, both types automated. E.g. the United States and China are taken over by their AI systems (inlcuding robot armies), but the Japanese AI assistants and robot army remain under human control and the future geopolitical system keeps both types of states intact thereafter.
  - SoerenMind 29 Apr 2019 12:00 UTC
    6 points
    Parent
    It’d be nice to hear a response from Paul to paragraph 1. My 2 cents:
    
    I tend to agree that we end up with extremes eventually. You seem to say that we would immediately go to alignment given somewhat aligned systems so Paul’s 1st story barely plays out.
    
    Of course, the somewhat aligned systems may aim at the wrong thing if we try to make them solve alignment. So the most plausible way it could work is if they produce solutions that we can check. But if this were the case, human supervision would be relatively easy. That’s plausible but it’s a scenario I care less about.
    
    Additionally, if we could use somewhat aligned systems to make more aligned ones, iterated amplification probably works for alignment (narrowly defined by “trying to do what we want”). The only remaining challenge would be to create one system that’s somewhat smarter than us and somewhat aligned (in our case that’s true by assumption). The rest follows, informally speaking, by induction as long as the AI+humans system can keep improving intelligence as alignment is improved. Which seems likely. That’s also plausible but it’s a big assumption and may not be the most important scenario / isn’t a ‘tale of doom’.