If the tool is sufficiently reflective to recommend improvements to itself, it will recommend that it be modified to just implement its proposed policies instead of printing them.
Perhaps. I noticed a related problem: someone will want to create a self-modifying AI. Let’s say we ask the Oracle AI about this plan. At present (as I understand it) we have no mathematical way to predict the effects of self-modification. (Hence Eliezer’s desire for a new decision theory that can do this.) So how did we give our non-self-modifying Oracle that ability? Wouldn’t we need to know the math of getting the right answer in order to write a program that gets the right answer? And if it can’t answer the question:
What will it even do at that point?
If it happens to fail safely, will humans as we know them interpret this non-answer to mean we should delay our plan for self-modifying AI?
Perhaps. I noticed a related problem: someone will want to create a self-modifying AI. Let’s say we ask the Oracle AI about this plan. At present (as I understand it) we have no mathematical way to predict the effects of self-modification. (Hence Eliezer’s desire for a new decision theory that can do this.) So how did we give our non-self-modifying Oracle that ability? Wouldn’t we need to know the math of getting the right answer in order to write a program that gets the right answer? And if it can’t answer the question:
What will it even do at that point?
If it happens to fail safely, will humans as we know them interpret this non-answer to mean we should delay our plan for self-modifying AI?