Rubi J. Hudson comments on Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural

Rubi J. Hudson 21 Jul 2024 22:16 UTC
LW: 1 AF: 1
0
AF
I was using the list of desiderate in Section 2 of the paper, which are slightly more minimal.
However, it seems clear to me that an AI manipulating it’s programmers falls under safe exploration, since the impact of doing so would be drastic and permanent. If we have an AI that is corrigible in the sense that it is indifferent to having its goals changed, then a preference to avoid manipulation is not anti-natural.
- Max Harms 22 Jul 2024 16:37 UTC
  LW: 1 AF: 1
  0
  AF Parent
  If I’m hearing you right, a shutdownable AI can have a utility function that (aside from considerations of shutdown) just gives utility scores to end-states as represented by a set of physical facts about some particular future time, and this utility function can be set up to avoid manipulation.
  How does this work? Like, how can you tell by looking at the physical universe in 100 years whether I was manipulated in 2032?
  - Rubi J. Hudson 24 Jul 2024 7:58 UTC
    LW: 1 AF: 1
    0
    AF Parent
    I don’t think we have the right tools to make an AI take actions that are low impact and reversible, but if we can develop them the plan as I see it would be to implement those properties to avoid manipulation in the short term and use that time to go from a corrigible AI to a fully aligned one.