I also think this is interesting, but whenever I see a proposal like this I like to ask, does it work on philosophical topics, where we don’t have a list of true and false statements that we can be very sure about, and we also don’t have a clear understanding of what kinds of arguments or sentences count as good arguments what kinds count as manipulation? There could be deception tactics specific to philosophy or certain philosophical topics, which can’t be found by training on other topics (and you can’t train directly on philosophy because of the above issues). You don’t talk about this problem directly, but it’s a bit similar to the first “disadvantage” that you mention:
The deception tactics used may be specific to that particular context, and not something that can be learned ahead of time.
I’m also concerned about how we’ll teach AIs to think about philosophical topics (and indeed, how we’re supposed to think about them ourselves). But my intuition is that proposals like this looks great on that perspective.
For areas where we don’t have empirical feedback-loops (like many philosophical topics), I imagine that the “baseline solution” for getting help from AIs is to teach them to imitate our reasoning. Either just by literally writing the words that it predicts that we would write (but faster), or by having it generate arguments that we would think looks good. (Potentially recursively, c.f. amplification, debate, etc.)
(A different direction is to predict what we would think after thinking about it more. That has some advantages, but it doesn’t get around the issue where we’re at-best speeding things up.)
One of the few plausible-seeming ways to outperform that baseline is to identify epistemic practices that work well on questions where we do have empirical feedback loops, and then transferring those practices to questions where we lack such feedback loops. (C.f. imitative generalization.) The above proposal is doing that for a specific sub-category of epistemic practices (recognising ways in which you can be misled by an argument).
Worth noting: The broad category of “transfer epistemic practices from feedback-rich questions to questions with little feedback” contains a ton of stuff, and is arguably the root of all our ability to reason about these topics:
Evolution selected human genes for ability to accomplish stuff in the real world. That made us much better at reasoning about philosophy than our chimp ancestors are.
If someone is optimistic that humans will be better at dealing with philosophy after intelligence-enhancement, I think they’re mostly appealing to stuff like this, since intelligence would typically be measured in areas where you can recognise excellent performance.
For areas where we don’t have empirical feedback-loops (like many philosophical topics), I imagine that the “baseline solution” for getting help from AIs is to teach them to imitate our reasoning. Either just by literally writing the words that it predicts that we would write (but faster), or by having it generate arguments that we would think looks good. (Potentially recursively, c.f. amplification, debate, etc.)
This seems like the default road that we’re walking down, but can ML learn everything that is important to learn? I questioned this in Some Thoughts on Metaphilosophy
One of the few plausible-seeming ways to outperform that baseline is to identify epistemic practices that work well on questions where we do have empirical feedback loops, and then transferring those practices to questions where we lack such feedback loops.
It seems plausible to me that this could help, but also plausible that philosophical reasoning (at least partly) involves cognition that’s distinct from empirical reasoning, so that such techniques can’t capture everything that is important to capture. (Some Thoughts on Metaphilosophy contains some conjectures relevant to this, so please read that if you haven’t already.)
BTW, it looks like you’re working/thinking in this direction, which I appreciate, but doesn’t it seem to you that the topic is super neglected (even compared to AI alignment) given that the risks/consequences of failing to correctly solve this problem seem comparable to the risk of AI takeover?(See for example Paul Christiano’s probabilities.) I find it frustrating/depressing how few people even mention it in passing as an important problem to be solved as they talk about AI-related risks (Paul being a rare exception to this).
doesn’t it seem to you that the topic is super neglected (even compared to AI alignment) given that the risks/consequences of failing to correctly solve this problem seem comparable to the risk of AI takeover?
Yes, I’m sympathetic. Among all the issues that will come with AI, I think alignment is relatively tractable (at least it is now) and that it has an unusually clear story for why we shouldn’t count on being able to defer it to smarter AIs (though that might work). So I think it’s probably correct for it to get relatively more attention. But even taking that into account, the non-alignment singularity issues do seem too neglected.
I’m currently trying to figure out what non-alignment stuff seems high-priority and whether I should be tackling any of it.
I also think this is interesting, but whenever I see a proposal like this I like to ask, does it work on philosophical topics, where we don’t have a list of true and false statements that we can be very sure about, and we also don’t have a clear understanding of what kinds of arguments or sentences count as good arguments what kinds count as manipulation? There could be deception tactics specific to philosophy or certain philosophical topics, which can’t be found by training on other topics (and you can’t train directly on philosophy because of the above issues). You don’t talk about this problem directly, but it’s a bit similar to the first “disadvantage” that you mention:
I’m also concerned about how we’ll teach AIs to think about philosophical topics (and indeed, how we’re supposed to think about them ourselves). But my intuition is that proposals like this looks great on that perspective.
For areas where we don’t have empirical feedback-loops (like many philosophical topics), I imagine that the “baseline solution” for getting help from AIs is to teach them to imitate our reasoning. Either just by literally writing the words that it predicts that we would write (but faster), or by having it generate arguments that we would think looks good. (Potentially recursively, c.f. amplification, debate, etc.)
(A different direction is to predict what we would think after thinking about it more. That has some advantages, but it doesn’t get around the issue where we’re at-best speeding things up.)
One of the few plausible-seeming ways to outperform that baseline is to identify epistemic practices that work well on questions where we do have empirical feedback loops, and then transferring those practices to questions where we lack such feedback loops. (C.f. imitative generalization.) The above proposal is doing that for a specific sub-category of epistemic practices (recognising ways in which you can be misled by an argument).
Worth noting: The broad category of “transfer epistemic practices from feedback-rich questions to questions with little feedback” contains a ton of stuff, and is arguably the root of all our ability to reason about these topics:
Evolution selected human genes for ability to accomplish stuff in the real world. That made us much better at reasoning about philosophy than our chimp ancestors are.
Cultural evolution seems to have at least partly promoted reasoning practices that do better at deliberation. (C.f. possible benefits from coupling competition and deliberation.)
If someone is optimistic that humans will be better at dealing with philosophy after intelligence-enhancement, I think they’re mostly appealing to stuff like this, since intelligence would typically be measured in areas where you can recognise excellent performance.
This seems like the default road that we’re walking down, but can ML learn everything that is important to learn? I questioned this in Some Thoughts on Metaphilosophy
It seems plausible to me that this could help, but also plausible that philosophical reasoning (at least partly) involves cognition that’s distinct from empirical reasoning, so that such techniques can’t capture everything that is important to capture. (Some Thoughts on Metaphilosophy contains some conjectures relevant to this, so please read that if you haven’t already.)
BTW, it looks like you’re working/thinking in this direction, which I appreciate, but doesn’t it seem to you that the topic is super neglected (even compared to AI alignment) given that the risks/consequences of failing to correctly solve this problem seem comparable to the risk of AI takeover?(See for example Paul Christiano’s probabilities.) I find it frustrating/depressing how few people even mention it in passing as an important problem to be solved as they talk about AI-related risks (Paul being a rare exception to this).
Yes, I’m sympathetic. Among all the issues that will come with AI, I think alignment is relatively tractable (at least it is now) and that it has an unusually clear story for why we shouldn’t count on being able to defer it to smarter AIs (though that might work). So I think it’s probably correct for it to get relatively more attention. But even taking that into account, the non-alignment singularity issues do seem too neglected.
I’m currently trying to figure out what non-alignment stuff seems high-priority and whether I should be tackling any of it.