It seems that humans, starting from a philosophically confused state, are liable to find multiple incompatible philosophies highly plausible in a path-dependent way, see for example analytic vs continental philosophy vs non-Western philosophies. I think this means if we train an AI to optimize directly for plausibility, there’s little assurance that we actually end up with philosophical truth.
A better plan is to train the AI in some way that does not optimize directly for plausibility, have some independent reason to think that the AI will be philosophically competent, and then use plausibility only as a test to detect errors in this process. I’ve written in the past that ideally we would first solve metaphilosophy so we that we can design the AI and the training process with a good understanding of the nature of philosophy and philosophical reasoning in mind, but failing that, I think some of the ideas in your list are still better than directly optimizing for plausibility.
You can do something like train it with RL in an environment where doing good philosophy is instrumentally useful and then hope it becomes competent via this mechanism.
This is an interesting idea. If it was otherwise feasible / safe / a good idea, we could perhaps train AI in a variety of RL environments, see which ones produce AIs that end up doing something like philosophy, and then see if we can detect any patterns or otherwise use the results to think about next steps.
It seems that humans, starting from a philosophically confused state, are liable to find multiple incompatible philosophies highly plausible in a path-dependent way, see for example analytic vs continental philosophy vs non-Western philosophies. I think this means if we train an AI to optimize directly for plausibility, there’s little assurance that we actually end up with philosophical truth.
A better plan is to train the AI in some way that does not optimize directly for plausibility, have some independent reason to think that the AI will be philosophically competent, and then use plausibility only as a test to detect errors in this process. I’ve written in the past that ideally we would first solve metaphilosophy so we that we can design the AI and the training process with a good understanding of the nature of philosophy and philosophical reasoning in mind, but failing that, I think some of the ideas in your list are still better than directly optimizing for plausibility.
This is an interesting idea. If it was otherwise feasible / safe / a good idea, we could perhaps train AI in a variety of RL environments, see which ones produce AIs that end up doing something like philosophy, and then see if we can detect any patterns or otherwise use the results to think about next steps.