I guess it feels like I don’t know how we could know that we’re in the position that we’ve “solved” meta-philosophy.
What I imagine is reaching a level of understanding of what we’re really doing (or what we should be doing) when we “do philosophy”, on par with our current understanding of what “doing math” or “doing science” consist of, or ideally a better level of of understanding than that. (See Apparent Unformalizability of “Actual” Induction for one issue with our current understanding of “doing science”.)
I also don’t think we know how to specify a ground truth reasoning process that we could try to protect and run forever which we could be completely confident would come up with the right outcome (where something like HCH is a good candidate but potentially with bugs/subtleties that need to be worked out).
Here I’m imagining something like putting a group of the best AI researchers, philosophers, etc. in some safe and productive environment (which includes figuring out the right rules of social interactions), where they can choose to delegate further to other reasoning processes, but don’t face any time pressure to do so. Obviously I don’t know how to specify this in terms of having all the details worked out, but that does not seem like a hugely difficult problem to solve, so I wonder what do you mean/imply by “don’t think we know how”?
It feels like the thing we could do is build a set of better and better models of philosophy and check their results against held-out human reasoning and against each other.
If that’s all we do, it seems like it would be pretty easy to miss some error in the models, because we didn’t know that we should test for it. For example there could be entire classes of philosophical problems that the models will fail on, which we won’t know because we won’t have realized yet that those classes of problems even exist.
Do you think this would lead to “good outcomes”? Do you think some version of this approach could be satisfactory for solving the problems in Two Neglected Problems in Human-AI Safety?
It could, but it seems much riskier than either of the approaches I described above.
Do you think there’s a different kind of thing that we would need to do to “solve metaphilosophy”? Or do you think that working on “solving metaphilosophy” roughly caches out as “work on coming up with better and better models of philosophy in the model I’ve described here”?
Hopefully I answered these sufficiently above. Let me know if there’s anything I can clear up further.
What I imagine is reaching a level of understanding of what we’re really doing (or what we should be doing) when we “do philosophy”, on par with our current understanding of what “doing math” or “doing science” consist of, or ideally a better level of of understanding than that. (See Apparent Unformalizability of “Actual” Induction for one issue with our current understanding of “doing science”.)
Here I’m imagining something like putting a group of the best AI researchers, philosophers, etc. in some safe and productive environment (which includes figuring out the right rules of social interactions), where they can choose to delegate further to other reasoning processes, but don’t face any time pressure to do so. Obviously I don’t know how to specify this in terms of having all the details worked out, but that does not seem like a hugely difficult problem to solve, so I wonder what do you mean/imply by “don’t think we know how”?
If that’s all we do, it seems like it would be pretty easy to miss some error in the models, because we didn’t know that we should test for it. For example there could be entire classes of philosophical problems that the models will fail on, which we won’t know because we won’t have realized yet that those classes of problems even exist.
It could, but it seems much riskier than either of the approaches I described above.
Hopefully I answered these sufficiently above. Let me know if there’s anything I can clear up further.