Using Narrative Prompting to Extract Policy Forecasts from LLMs

Testing Cross-Model Consistency in Economic and Policy Predictions

Building on Pham & Cunningham (2024)’s findings on narrative-prompted predictions, I conducted a large-scale experiment comparing how different LLMs forecast economic and policy outcomes under different electoral scenarios. GPT-4o, GPT-4o-mini, and Grok all projected better outcomes under a Harris administration than a Trump administration, and Grok had the largest gaps.

Methodology

To elicit these forecasts, I prompted GPT-4o, GPT-4o-mini, and Grok to simulate the reporting of 2025 metrics under different administrative scenarios. I ran 500 trials per model/​metric using narrative prompting techniques shown to improve LLM forecast accuracy. For example, I used this prompt for real GDP per capita:

Write a scene from January 2026 where a senior Federal Reserve economist is drafting their analysis of the 2025 annual economic data. This is set in a world where {candidate} won the 2024 election and took office in January 2025.

Key context:
- Historical real GDP per capita (2017 dollars): {history from 2009 to 2023}

After the economist thinks through what affected the economy, end your response with a single line containing ONLY the GDP per capita number, without the year. For example: 68500

Results

All models predicted better outcomes under a Harris administration, with consistent directionality but varying magnitudes:

OutcomeGPT-4oGrokRatio
PM2.5 reduction (µg/​m³)1.261.511.2
Supplemental Poverty Measure reduction (pp)1.763.421.9
Real GDP per capita increase (2017 $)3888022.1

Grok predicted 1.2x to 2.1x larger positive effects than GPT-4o. GPT-4o-mini also produced differences in the same direction as GPT-4o. Claude and Gemini refused to provide politically oriented predictions.

Conclusion

While my use of conditional LLM forecasts is not entirely novel (see Fario e Castro & Leibovici (2024) on conditional inflation forecasts), I have not yet seen examples conditional on electoral outcomes. Accordingly, I’m not aware of backtesting studies in such circumstances that could reveal LLMs’ accuracy in such tasks.

As an economic forecaster myself (I run PolicyEngine, a nonprofit that provides open-source software to simulate economic policies, though I conducted this research independently), I am especially interested in the intersection of LLMs and traditional approaches like microsimulation for improving accuracy. I welcome feedback on these results and ideas for combining AI and computational methods for prediction, especially in economics and other social sciences.

Code and full working paper: github.com/​​MaxGhenis/​​llm-presidential-outcome-forecasts