Thane Ruthenis comments on Accurate Models of AI Risk Are Hyperexistential Exfohazards

Thane Ruthenis 26 Dec 2022 3:10 UTC
1 point
0
Fair point, I should’ve mentioned alignment by default. That said, even the original post introducing it considers it ~10% likely at best.
- DragonGod 26 Dec 2022 7:33 UTC
  4 points
  3
  Parent
  I think Wentworth is too pessimistic:
  - Wentworth’s scheme of alignment by default is not the only scheme to it
  - We might get partial alignment by default and strengthen it
  There are two approaches to solving alignment:
  1. Targeting AI systems at values we’d be “happy” (where we fully informed) for powerful systems to optimise for [AKA intent alignment] [RHLF, IRL, value learning more generally, etc.]
  2. Safeguarding systems that are not necessarily robustly intent aligned [Corrigibility, impact regularisation, boxing, myopia, non agentic systems, mild optimisation, etc.]
  We might solve alignment by applying the techniques of 2, to a system that is somewhat aligned. Such an approach becomes more likely if we get partial alignment by default.
  
  More concretely, I currently actually believe not just pretending to believe that:
  - Self supervised learning on human generated/curated data will get to AGI first
  - Systems trained in such a way may be very powerful while still being reasonably safe from misalignment risks(enhanced with safeguarding techniques) without us mastering intent alignment/being able to target arbitrary AI systems at arbitrary goals
  I really do not think this is some edge case, but a way the world can be with significant probability mass.