They are simluators (https://www.lesswrong.com/posts/vJFdjigzmcXMhNTsx/simulators), not question answerers. Also, I am sure Minerva does pretty good on this task, probably not 100% reliable but humans are also not 100% reliable if they are required to answer immediately. If you want the ML model to simulate thinking [better], make it solve this task 1000 times and select the most popular answer (which is a quite popular approach for some models already). I think PaLM would be effectively 100% reliable.
They are simluators (https://www.lesswrong.com/posts/vJFdjigzmcXMhNTsx/simulators), not question answerers. Also, I am sure Minerva does pretty good on this task, probably not 100% reliable but humans are also not 100% reliable if they are required to answer immediately. If you want the ML model to simulate thinking [better], make it solve this task 1000 times and select the most popular answer (which is a quite popular approach for some models already). I think PaLM would be effectively 100% reliable.