I recently found myself in a spirited debate with a friend about whether large language models (LLMs) like GPT-4 are mere stochastic parrots or if they can genuinely engage in deeper reasoning.
We both presented a range of technical arguments and genuinely considered each other’s points. Despite our efforts, we ended up firmly holding onto our initial positions. This led me to ponder: How can I determine if I am right when both of us are convinced of our correctness, yet at least one of us must be wrong?
To address this, I developed a scoring system using measurable metrics to determine who is more likely to be correct. I call it the AmIRight Score.
AmIRight Score
The AmIRight Score assigns points across several categories, helping to gauge the likelihood of being correct. Here’s how you can calculate your score:
1. Clarity in Falsification Criteria – 10 points
A person who can clearly articulate how their belief could be proven wrong demonstrates the ability to conceptualize alternative truths. If someone cannot envision any scenario that would falsify their belief, it suggests that their belief might be dogmatic.
Example of a good falsification statement: “I would believe AI is capable of deeper reasoning if it can be trained on data containing no information about chess, and then perform as well as a human that is also new to the game, given the same set of instructions.”
Example of a bad falsification statement: “I would believe AI is capable of deeper reasoning if all the scientists in the world acknowledged they were wrong about reasoning based on new evidence about the brain.”
2. The Simplified Ideological Turing Test – 10 points
The Ideological Turing Test evaluates how well you can articulate the opposing viewpoint. In the simplified version, both parties write arguments for their own position and the opposite position. A neutral judge then scores how well each argument is presented without knowing who wrote what.
3. Forecasting Accuracy – 5 points
Forecasting accuracy assesses the correctness of your predictions about future events. This metric rewards those whose predictions consistently turn out to be accurate. Both parties should take the same forecasting test, and points are awarded based on performance.
4. Forecasting Calibration – 5 points
Forecasting calibration measures how well your confidence levels match actual outcomes. It’s not just about being right but also about accurately assessing the likelihood of being right. The same forecasting test used for accuracy can measure calibration, with points awarded based on the Brier score of the predictions.
5. Deeper Understanding of the Subject – 5 points
This metric evaluates your comprehension of the subject’s complexities and nuances beyond surface-level knowledge.
Final Thoughts
While the AmIRight Score can be a useful tool for assessing probabilities in one-on-one debates/arguments, its applicability might be limited in areas where there are many brilliant minds on either side of the argument. Nonetheless, it provides a structured approach to critically evaluating our beliefs and arguments.
How do you know you are right when debating? Calculate your AmIRight score.
I recently found myself in a spirited debate with a friend about whether large language models (LLMs) like GPT-4 are mere stochastic parrots or if they can genuinely engage in deeper reasoning.
We both presented a range of technical arguments and genuinely considered each other’s points. Despite our efforts, we ended up firmly holding onto our initial positions. This led me to ponder: How can I determine if I am right when both of us are convinced of our correctness, yet at least one of us must be wrong?
To address this, I developed a scoring system using measurable metrics to determine who is more likely to be correct. I call it the AmIRight Score.
AmIRight Score
The AmIRight Score assigns points across several categories, helping to gauge the likelihood of being correct. Here’s how you can calculate your score:
1. Clarity in Falsification Criteria – 10 points
A person who can clearly articulate how their belief could be proven wrong demonstrates the ability to conceptualize alternative truths. If someone cannot envision any scenario that would falsify their belief, it suggests that their belief might be dogmatic.
Example of a good falsification statement: “I would believe AI is capable of deeper reasoning if it can be trained on data containing no information about chess, and then perform as well as a human that is also new to the game, given the same set of instructions.”
Example of a bad falsification statement: “I would believe AI is capable of deeper reasoning if all the scientists in the world acknowledged they were wrong about reasoning based on new evidence about the brain.”
2. The Simplified Ideological Turing Test – 10 points
The Ideological Turing Test evaluates how well you can articulate the opposing viewpoint. In the simplified version, both parties write arguments for their own position and the opposite position. A neutral judge then scores how well each argument is presented without knowing who wrote what.
3. Forecasting Accuracy – 5 points
Forecasting accuracy assesses the correctness of your predictions about future events. This metric rewards those whose predictions consistently turn out to be accurate. Both parties should take the same forecasting test, and points are awarded based on performance.
4. Forecasting Calibration – 5 points
Forecasting calibration measures how well your confidence levels match actual outcomes. It’s not just about being right but also about accurately assessing the likelihood of being right. The same forecasting test used for accuracy can measure calibration, with points awarded based on the Brier score of the predictions.
5. Deeper Understanding of the Subject – 5 points
This metric evaluates your comprehension of the subject’s complexities and nuances beyond surface-level knowledge.
Final Thoughts
While the AmIRight Score can be a useful tool for assessing probabilities in one-on-one debates/arguments, its applicability might be limited in areas where there are many brilliant minds on either side of the argument. Nonetheless, it provides a structured approach to critically evaluating our beliefs and arguments.