Max Harms comments on Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural

Max Harms 22 Jul 2024 16:37 UTC
LW: 1 AF: 1
0
AF
If I’m hearing you right, a shutdownable AI can have a utility function that (aside from considerations of shutdown) just gives utility scores to end-states as represented by a set of physical facts about some particular future time, and this utility function can be set up to avoid manipulation.
How does this work? Like, how can you tell by looking at the physical universe in 100 years whether I was manipulated in 2032?
- Rubi J. Hudson 24 Jul 2024 7:58 UTC
  LW: 1 AF: 1
  0
  AF Parent
  I don’t think we have the right tools to make an AI take actions that are low impact and reversible, but if we can develop them the plan as I see it would be to implement those properties to avoid manipulation in the short term and use that time to go from a corrigible AI to a fully aligned one.