But can’t we view any meta-level solution as also (trivially) an object level solution?
Sure, I’m saying that it seems easier to reach the object level solution by working on the meta level (except in so far as working on the object level helps with meta-level understanding). As an analogy, if you need to factor a large number, it’s much easier to try to find an algorithm for doing so than to try to factor the number yourself.
And if we aren’t able to verify that an AI is free from decision theoretic flaws, then how can we trust it to self-modify to be free of such flaws?
How do we verify that an AI is free from decision theoretic flaws, given that we don’t have a formal definition of what counts as a “decision theoretic flaw”? (If we did have such a formal definition, the definition itself might have a flaw.) It seems like the only way to do that is to spend a lot of researcher hours searching the AI (or the formal definition) for flaws (or what humans would recognize as a flaw), but no matter how much time we spend, it seems like we can’t definitively rule out the possibility that there might still be some flaws left that we haven’t found. (See Philosophy as interminable debate.)
One might ask why the same concern doesn’t apply to metaphilosophy. Well, I’m guessing that “correct metaphilosophy” might have a bigger basin of attraction around itself than “correct decision theory” so not being able to be sure that we’ve found and fixed all flaws might be less crucial. As evidence for this, it seems clear that humans don’t have (i.e., aren’t already using) a decision theory that is in the basin of attraction around “correct decision theory”, but we do plausibly have a metaphilosophy that is in the basin of attraction around “correct metaphilosophy”.
What’s the best description of what you mean by “metaphilosophy” you can point me to? I think I have a pretty good sense of it, but it seems worthwhile to be as rigorous / formal / descriptive / etc. as possible.
Black-Box Metaphilosophical AI—Program the AI to use the minds of one or more human philosophers as a black box to help it solve philosophical problems, without the AI builders understanding what “doing philosophy” actually is.
White-Box Metaphilosophical AI—Understand the nature of philosophy well enough to specify “doing philosophy” as an algorithm and code it into the AI.
(One could also imagine approaches that are somewhere in between these two, where for example AI designers have some partial understanding of what “doing philosophy” is, and programs the AI to learn from human philosophers based on this partial understanding.)
Sure, I’m saying that it seems easier to reach the object level solution by working on the meta level (except in so far as working on the object level helps with meta-level understanding). As an analogy, if you need to factor a large number, it’s much easier to try to find an algorithm for doing so than to try to factor the number yourself.
How do we verify that an AI is free from decision theoretic flaws, given that we don’t have a formal definition of what counts as a “decision theoretic flaw”? (If we did have such a formal definition, the definition itself might have a flaw.) It seems like the only way to do that is to spend a lot of researcher hours searching the AI (or the formal definition) for flaws (or what humans would recognize as a flaw), but no matter how much time we spend, it seems like we can’t definitively rule out the possibility that there might still be some flaws left that we haven’t found. (See Philosophy as interminable debate.)
One might ask why the same concern doesn’t apply to metaphilosophy. Well, I’m guessing that “correct metaphilosophy” might have a bigger basin of attraction around itself than “correct decision theory” so not being able to be sure that we’ve found and fixed all flaws might be less crucial. As evidence for this, it seems clear that humans don’t have (i.e., aren’t already using) a decision theory that is in the basin of attraction around “correct decision theory”, but we do plausibly have a metaphilosophy that is in the basin of attraction around “correct metaphilosophy”.
What’s the best description of what you mean by “metaphilosophy” you can point me to? I think I have a pretty good sense of it, but it seems worthwhile to be as rigorous / formal / descriptive / etc. as possible.
This description from Three Approaches to “Friendliness” perhaps gives the best idea of what I mean by “metaphilosophy”:
(One could also imagine approaches that are somewhere in between these two, where for example AI designers have some partial understanding of what “doing philosophy” is, and programs the AI to learn from human philosophers based on this partial understanding.)
For more of my thoughts on this topic, see Some Thoughts on Metaphilosophy and the posts that it links to.