I’m guessing you’re not being serious, but just in case you are, or in case someone misinterprets you now or in the future, I think we probably do not want to train AIs to give us answers optimized to sound plausible to humans, since that would make it even harder to determine whether or not the AI is actually competent at philosophy. (Not totally sure, as I’m confused about the nature of philosophy and philosophical reasoning, but I think we definitely don’t want to do that in our current epistemic state, i.e., unless we had some really good arguments that says it’s actually a good idea.)
Here are some other options which IMO reduce to a slight variation on the same thing or are unlikely to work:
Train your AI on predicting/imitating a huge amount of human output and then prompt/finetune the model to imitate humans philosophy and hope this works. This is a reasonable baseline, but I expect it to clearly fail to produce sufficiently useful answers without further optimization. I also think it’s de facto optimizing for plausiblity to some extent due to properties of the human answer distribution.
Train your AI to give answers which sound extremely plausible (aka extremely likely to be right) in cases where humans are confident in the answers and then hope for generalization.
Train your AIs to give answers which pass various consistency checks. This reduces back to a particular notion of plausiblity.
Actually align your AI in some deep and true sense and ensure it has reasonably good introspective access. Then, just ask it questions. This is pretty unlikely to be technically feasible IMO, at least for the first very powerful AIs.
You can do something like train it with RL in an environment where doing good philosophy is instrumentally useful and then hope it becomes competent via this mechanism. This doesn’t solve the elicitation problem, but could in principle ensure the AI is actually capable. Further, I have no idea what such an environment would look like if any exists. (There are clear blockers to using this sort of approach to evaluate alignment work without some insane level of simulation, I think similar issues apply with philosophy.)
Ultimately, everything is just doing some sort of optimization for something like “how good do you think it is” (aka plausibility). For instance, I do this while thinking of ideas. So, I don’t really think this is avoidable at some level. You might be able to avoid gaps in abilities between the entity optimizing and the entity judging (as is typically the case in my brain) and this solves some of the core challenges TBC.
It seems that humans, starting from a philosophically confused state, are liable to find multiple incompatible philosophies highly plausible in a path-dependent way, see for example analytic vs continental philosophy vs non-Western philosophies. I think this means if we train an AI to optimize directly for plausibility, there’s little assurance that we actually end up with philosophical truth.
A better plan is to train the AI in some way that does not optimize directly for plausibility, have some independent reason to think that the AI will be philosophically competent, and then use plausibility only as a test to detect errors in this process. I’ve written in the past that ideally we would first solve metaphilosophy so we that we can design the AI and the training process with a good understanding of the nature of philosophy and philosophical reasoning in mind, but failing that, I think some of the ideas in your list are still better than directly optimizing for plausibility.
You can do something like train it with RL in an environment where doing good philosophy is instrumentally useful and then hope it becomes competent via this mechanism.
This is an interesting idea. If it was otherwise feasible / safe / a good idea, we could perhaps train AI in a variety of RL environments, see which ones produce AIs that end up doing something like philosophy, and then see if we can detect any patterns or otherwise use the results to think about next steps.
I’m guessing you’re not being serious, but just in case you are, or in case someone misinterprets you now or in the future, I think we probably do not want to train AIs to give us answers optimized to sound plausible to humans, since that would make it even harder to determine whether or not the AI is actually competent at philosophy. (Not totally sure, as I’m confused about the nature of philosophy and philosophical reasoning, but I think we definitely don’t want to do that in our current epistemic state, i.e., unless we had some really good arguments that says it’s actually a good idea.)
How else will you train your AI?
Here are some other options which IMO reduce to a slight variation on the same thing or are unlikely to work:
Train your AI on predicting/imitating a huge amount of human output and then prompt/finetune the model to imitate humans philosophy and hope this works. This is a reasonable baseline, but I expect it to clearly fail to produce sufficiently useful answers without further optimization. I also think it’s de facto optimizing for plausiblity to some extent due to properties of the human answer distribution.
Train your AI to give answers which sound extremely plausible (aka extremely likely to be right) in cases where humans are confident in the answers and then hope for generalization.
Train your AIs to give answers which pass various consistency checks. This reduces back to a particular notion of plausiblity.
Actually align your AI in some deep and true sense and ensure it has reasonably good introspective access. Then, just ask it questions. This is pretty unlikely to be technically feasible IMO, at least for the first very powerful AIs.
You can do something like train it with RL in an environment where doing good philosophy is instrumentally useful and then hope it becomes competent via this mechanism. This doesn’t solve the elicitation problem, but could in principle ensure the AI is actually capable. Further, I have no idea what such an environment would look like if any exists. (There are clear blockers to using this sort of approach to evaluate alignment work without some insane level of simulation, I think similar issues apply with philosophy.)
Ultimately, everything is just doing some sort of optimization for something like “how good do you think it is” (aka plausibility). For instance, I do this while thinking of ideas. So, I don’t really think this is avoidable at some level. You might be able to avoid gaps in abilities between the entity optimizing and the entity judging (as is typically the case in my brain) and this solves some of the core challenges TBC.
It seems that humans, starting from a philosophically confused state, are liable to find multiple incompatible philosophies highly plausible in a path-dependent way, see for example analytic vs continental philosophy vs non-Western philosophies. I think this means if we train an AI to optimize directly for plausibility, there’s little assurance that we actually end up with philosophical truth.
A better plan is to train the AI in some way that does not optimize directly for plausibility, have some independent reason to think that the AI will be philosophically competent, and then use plausibility only as a test to detect errors in this process. I’ve written in the past that ideally we would first solve metaphilosophy so we that we can design the AI and the training process with a good understanding of the nature of philosophy and philosophical reasoning in mind, but failing that, I think some of the ideas in your list are still better than directly optimizing for plausibility.
This is an interesting idea. If it was otherwise feasible / safe / a good idea, we could perhaps train AI in a variety of RL environments, see which ones produce AIs that end up doing something like philosophy, and then see if we can detect any patterns or otherwise use the results to think about next steps.