Joseph Miller comments on Bengio’s Alignment Proposal: “Towards a Cautious Scientist AI with Convergent Safety Bounds”

Joseph Miller 29 Feb 2024 22:13 UTC
1 point
0
If an AI somehow (implicitly, in practice) kept track of all the plausible H’s, i.e., those with high probability under P(H | D), then there would be a perfectly safe way to act: if any of the plausible hypotheses predicted that some action caused a major harm (like the death of humans), then the AI should not choose that action. Indeed, if the correct hypothesis H* predicts harm, it means that some plausible H predicts harm. Showing that no such H exists therefore rules out the possibility that this action yields harm, and the AI can safely execute it.
This idea seems to ignore the problem that the null action can also entail harm. In a trolley problem this AI would never be able to pull the lever.
Maybe you could get around this by saying that it compares the entire wellbeing of the world with and without its intervention. But still in that case if it had any uncertainty as to which way had the most harm, it would be systematically biased toward inaction, even when the expected harm was clearly less if it took action.
What links here?
- ryan_greenblatt's comment on Bengio’s Alignment Proposal: “Towards a Cautious Scientist AI with Convergent Safety Bounds” by mattmacdermott (2 Mar 2024 3:15 UTC; 2 points)
- Steven Byrnes 1 Mar 2024 3:56 UTC
  9 points
  3
  Parent
  [mostly-self-plaigiarized from here] If you have a very powerful AI, but it’s designed such that you can’t put it in charge of a burning airplane hurtling towards the ground, that’s … fine, right? I think it’s OK to have first-generation AGIs that can sometimes get “paralyzed by indecision”, and which are thus not suited to solving crises where every second counts. Such an AGI could still do important work like inventing new technology, and in particular designing better and safer second-generation AGIs.
  You only really get a problem if your AI finds that there is no sufficiently safe way to act, and so it doesn’t do anything at all. (Or more broadly, if it doesn’t do anything very useful.) Even that’s not dangerous in itself… but then the next thing that happens is the programmer would probably dial the “conservatism” knob down lower and lower, until the AI starts doing useful things. Maybe the programmer says to themselves: “Well, we don’t have a perfect proof, but all the likely hypotheses predict there’s probably no major harm…”
  Also, humans tend to treat “a bad thing happened (which had nothing to do with me)” as much less bad than “a bad thing happened (and it’s my fault)”. I think that if it’s possible to make AIs with the same inclination, then it seems like probably a good idea to do so, at least until we get up to super-reliable 12th-generation AGIs or whatever. It’s dangerous to make AIs that notice injustice on the other side of the world and are immediately motivated to fix it—that kind of AI would be very difficult to keep under human control, if human control is the plan (as it seems to be here).
  Sorry if I’m misunderstanding.
  - Joseph Miller 1 Mar 2024 13:56 UTC
    1 point
    0
    Parent
    Yup I think I agree. However I could see this going wrong in some kind of slow takeoff world where the AI is already in charge of many things in the world.