RohanS comments on Corrigibility = Tool-ness?

RohanS 8 Jul 2024 19:02 UTC
3 points
0
Here’s our current best guess at how the type signature of subproblems differs from e.g. an outermost objective. You know how, when you say your goal is to “buy some yoghurt”, there’s a bunch of implicit additional objectives like “don’t spend all your savings”, “don’t turn Japan into computronium”, “don’t die”, etc? Those implicit objectives are about respecting modularity; they’re a defining part of a “gap in a partial plan”. An “outermost objective” doesn’t have those implicit extra constraints, and is therefore of a fundamentally different type from subproblems.
Most of the things you think of day-to-day as “problems” are, cognitively, subproblems.
Do you have a starting point for formalizing this? It sounds like subproblems are roughly proxies that could be Goodharted if (common sense) background goals aren’t respected. Maybe a candidate starting point for formalizing subproblems, relative to an outermost objective, is “utility functions that closely match the outermost objective in a narrow domain”?
- johnswentworth 15 Jul 2024 19:02 UTC
  3 points
  0
  Parent
  My current starting point would be standard methods for decomposing optimization problems, like e.g. the sort covered in this course.