Holden wants to build Tool-AIs that output summaries of their calculations along with suggested actions. For Google Maps, I guess this would be the distance and driving times, but how does a Tool-AI summarize more general calculations that it might do?
It could give you the expected utilities of each option, but it’s hard to see how that helps if we’re concerned that its utility function or EU calculations might be wrong. Or maybe it could give a human-readable description of the predicted consequences of each option, but the process that produces such descriptions from the raw calculations would seem to require a great deal of intelligence on its own (for example it might have to describe posthuman worlds in terms understandable to us), and it itself wouldn’t be a “safe” Tool-AI, since the summaries produced would presumably not come with further alternative summaries and meta-summaries of how the summaries were calculated.
(My question might be tangential to your own comment. I just wanted your thoughts on it, and this seems to be the best place to ask.)
My understanding of Holden’s argument was that powerful optimization processes can be run in either tool-mode or agent-mode.
For example, Google maps optimizes routes, but returns the result with alternatives and options for editing, in “tool mode”.
Holden wants to build Tool-AIs that output summaries of their calculations along with suggested actions. For Google Maps, I guess this would be the distance and driving times, but how does a Tool-AI summarize more general calculations that it might do?
It could give you the expected utilities of each option, but it’s hard to see how that helps if we’re concerned that its utility function or EU calculations might be wrong. Or maybe it could give a human-readable description of the predicted consequences of each option, but the process that produces such descriptions from the raw calculations would seem to require a great deal of intelligence on its own (for example it might have to describe posthuman worlds in terms understandable to us), and it itself wouldn’t be a “safe” Tool-AI, since the summaries produced would presumably not come with further alternative summaries and meta-summaries of how the summaries were calculated.
(My question might be tangential to your own comment. I just wanted your thoughts on it, and this seems to be the best place to ask.)
The point is that we don’t want it to be a black box—we want to be able to get inside its head, so to speak.
(Of course, we can’t do that with humans, and that hasn’t stopped us, but it’s still a nice goal)