Introducing Squiggle AI

We’re releasing Squiggle AI, a tool that generates probabilistic models using the Squiggle language. This can provide early cost-effectiveness models and other kinds of probabilistic programs.

No prior Squiggle knowledge is required to use Squiggle AI. Simply ask for whatever you want to estimate, and the results should be fairly understandable. The Squiggle programming language acts as an adjustable backend, but isn’t mandatory to learn.

Beyond being directly useful, we’re interested in Squiggle AI as an experiment in epistemic reasoning with LLMs. We hope it will help highlight potential strengths, weaknesses, and directions for the field.

Screenshots

The “Playground” view after it finishes a successful workflow. Form on the left, code in the middle, code output on the right.
The “Steps” page. Shows all of the steps that the workflow went through, next to the form on the left. For each, shows a simplified view of recent messages to and from the LLM.

Motivation

Organizations in the effective altruism and rationalist communities regularly rely on cost-effectiveness analyses and fermi estimates to guide their decisions. QURI’s mission is to make these probabilistic tools more accessible and reliable for altruistic causes.

However, our experience with tools like Squiggle and Guesstimate has revealed a significant challenge: even highly skilled domain experts frequently struggle with the basic programming requirements and often make errors in their models. This suggests a need for alternative approaches.

Language models seem particularly well-suited to address these difficulties. Fermi estimates typically follow straightforward patterns and rely on common assumptions, making them potentially ideal candidates for LLM assistance. Previous direct experiments with Claude and ChatGPT alone proved insufficient, but with substantial iteration, we’ve developed a framework that significantly improves the output quality and user experience.

We’re focusing specifically on cost-effectiveness estimates because they represent a strong balance between specificity and broad applicability. These models often contain recurring patterns and sub-variables, yet can be applied to a wide spectrum of human decisions—from different kinds of charitable giving to policy interventions.

Looking ahead, if we succeed in automating cost-effectiveness analyses, we believe this could serve as one foundation for enhanced LLM reasoning capabilities. This work could pave the way for more sophisticated applications of AI in decision-making and impact assessment.

Description

Squiggle AI” combines LLM capabilities with specialized scaffolding to produce fermi estimates in the Squiggle programming language. The frontend runs on Squiggle Hub. The code is open-source and available on Github.

Since LLMs don’t naturally write valid Squiggle code, substantial work has gone into bridging this gap. Our system provides comprehensive documentation to the LLM, implements automatic error correction, and includes steps for model improvement and documentation.

Squiggle AI currently calls Claude 3.5 Sonnet. Each single LLM call costs $0.01-0.04, with full workflows ranging from $0.10-0.35. While we cover these costs currently, users can provide their own API keys. Most workflows complete within 20 seconds to 3 minutes. Resulting models typically are 100-200 lines long.

The tool offers two specific workflows:

  • Create”: Takes a high-level prompt to generate and enhance new Squiggle code

  • Improve”: Fixes broken code and implements requested improvements

Example Outputs

Select recent examples:

Is restoring the Notre Dame Cathedral or donating to the AMF more cost-effective, as a charity?
Comparing health impacts against tourism benefits, the model estimates that a $1B donation to AMF would generate 8M QALYs versus 9K QALYs for Notre Dame. The core calculation uses just 12 lines of code.

What is the cost-effectiveness of a 10-person nonprofit subscribing to Slack Pro?
This makes the core assumption that users will save 1 to 4 hours per month from having a searchable chat history. Results suggest an ROI of approximately 10x.

What is the cost-effectiveness of opening a new Bubble Tea store in Berkeley, CA?
Our most-tested example, popular in workshops for being fun and relevant (at least for people around Berkeley). When accounting for failure risk, the model usually shows negative expected value suggesting a risky investment.

Estimate the cost-effectiveness of different Animal Welfare interventions
Inspired by a discussion of genetic modification of chickens
Evaluated five intervention types, with legislative change emerging as most cost-effective, outperforming genetic modification (ranked second).

Estimate the probability that a single vote will decide the US presidential election, in various states
Inspired by some recent work for the 2024 election. Provides state-level probability estimates through two key data tables. Better input data would improve accuracy.

What is the probability that there will be an AGI Manhattan Project in the next decade, and how will it happen?
Projects 24% probability of an “AGI Manhattan Project” within a decade, primarily driven by potential crisis scenarios.

What is the expected benefit of a 30-year old male to get a HPV vaccine?
Analyzes benefits for 30-year-old males. Despite high costs ($500-$1,000 uninsured), estimates suggest $1k-$100k in expected benefits.

Estimate the cost-effectiveness of taboo activities for a teenager
Demonstrates how using AI scaffolding can do reasoning that goes against social norms, in favor of first-principles thinking. If you ask Claude directly to estimate the benefits of smoking cigarettes or of recreational sex for teenagers, it will usually refuse. This estimate by Squiggle AI states that smoking cigarettes would provide $500 to $2k per year of positive social capital. This is clearly overconfident, but it demonstrates the system’s willingness to make unconventional assumptions.

A more extensive collection of examples is available here. While some older examples there were manually edited, the examples listed above were not. Examples were typically selected out of 2-3 generations each.

How Good Is It?

We don’t yet have quantitative measures of output quality, partly due to the challenge of establishing ground-truth for cost-effectiveness estimates. However, we do have a variety of some qualitative results.

Early Use

As the primary user, I (Ozzie) have seen dramatic improvements in efficiency—model creation time has dropped from 2-3 hours to 10-30 minutes. For quick gut-checks, I often find the raw AI outputs informative enough to use without editing.

Our three Squiggle workshops (around 20 total attendees) have shown encouraging results, with participants strongly preferring Squiggle AI over manual code writing. Early adoption has been modest but promising—in recent months, 30 users outside our team have run 168 workflows total.

Accuracy Considerations

As with most LLM systems, Squiggle AI tends toward overconfidence and may miss crucial factors. We recommend treating its outputs as starting points rather than definitive analyses. The tool works best for quick sanity check and initial model drafts.

Current Limitations

Several technical constraints affect usage:

  • Code length soft-caps at 200 lines

  • Frequent workflow stalls from rate limits or API balance issues

  • Auto-generated documentation is decent but has gaps, particularly in outputting plots and diagrams

While slower and more expensive than single LLM queries, Squiggle AI provides more comprehensive and structured output, making it valuable for users who want detailed, adjustable, and documentable reasoning behind their estimates.

Alternatives

Direct LLM Estimates

Standard LLMs can perform basic numeric estimates quickly and cheaply. However, they typically lack support for executing programming functions or probabilistic simulations.

LLMs that Call Python

Recent systems with Python interpreters offer more computational power but aren’t optimized for intuitive probabilistic modeling. The results are not as interactive as Squiggle playgrounds are.

Basic Squiggle Integration

It’s possible to ask LLMs to generate Squiggle code for you. While most LLMs don’t know much about Squiggle, you can use our custom prompt or the Squiggle Bot on ChatGPT. These can work in a pinch, but are generally a lot less powerful than Squiggle AI.

Using Squiggle AI

Getting Started

  1. Create a Squiggle Hub account (required for abuse prevention)

  2. Navigate to the “AI” section in the header menu on the right

  3. Choose your workflow type:

    • “Create” to generate new models from prompts

    • “Edit” to refine existing Squiggle code

  4. Fill out the corresponding form. Settings have hover tooltips with extra information.

  5. Click “Start Workflow”

  6. If you’d like to make multiple workflows for the same setting, click “Start Workflow” multiple times.

  7. You should see a new workflow begin. This will likely take 5 to 10 seconds to run one LLM query, and up to 3 minutes to complete.

The Workflow form

Using Workflows

When you start a workflow, you’ll see real-time updates as the LLM processes your request. The interface displays:

  • Step-by-step progress through the LLM stages

  • Detailed logs upon completion

  • The most recent working version (or last failing version if errors occur)

Note that workflows occasionally stall—if this happens, simply start a new one.

While all workflow runs are saved in our backend (and will be accessible to you), older workflow logs get cleared out over time.

Saving Your Work

To save and/​or share your models:

Keep It Simple

Start with straightforward cost-effectiveness and probabilistic models. Complex requests often fail or require multiple iterations. It’s often fine to bring in a bunch of considerations into a cost-benefit analysis, the challenge comes when some of them require difficult functions.

Generate Multiple Models

Generate 2-4 different model variations for each question to capture a range of perspectives. Since Squiggle AI makes parallel model generation straightforward, you can easily launch multiple workflows with identical inputs. Each individual model is fairly inexpensive.

Be Specific with Details

When modeling specific scenarios:

  • Include key input statistics and metrics

  • Outline your intended outputs

  • Consider potential externalities and unusual but important considerations

  • Bring in research from tools like Perplexity

Optimize Your Workflow

  • Start with minimal settings (0 “Numeric” and “Documentation” steps)

  • Start with the “Create” workflow, then later use the “Improve” workflow to enhance your favorite models

  • Break complex problems into smaller components (100-200 lines each)

API Usage

For extensive use (20+ runs), consider using your own Anthropic API key. This helps ensure consistent service and reduces load on our shared token.

Complementary Tools

Research and Data Gathering

While Squiggle AI excels at model generation, it doesn’t perform web searches or data collection. For comprehensive analysis, consider combining it with:

Alternative LLM Integration

OpenAI o1 and other models can complement Squiggle AI’s capabilities. While these tools may outperform Claude 3.5 Sonnet on certain tasks, they typically come with other trade-offs. They can be useful for early ideation, coming up with crucial considerations, or even starting off with a Python model or similar.

Probability Distribution Tools

For estimating specific probability distributions, consider emerging tools like FutureSearch. Squiggle input estimates are often not calibrated.

This ecosystem approach—using each tool for its strengths—often produces the best analyses.

Privacy

Squiggle AI outputs on Squiggle Hub are private. If you want to share them or make them public, you can explicitly do that by creating new models.

Note that our team can access Squiggle AI results for diagnostic purposes. If you require more complete data privacy, please contact us to discuss options.

Technical Details

Performance and Costs

Since Squiggle is a young programming language, it typically needs more troubleshooting and adjustments compared to when using LLMs with more established languages.

Squiggle AI currently uses Claude Sonnet 3.5 for all operations. It makes use of prompt caching to cache a lot of information (around 20k tokens) about the Squiggle language. LLM queries typically cost around $0.02 each to run—more in the case of where large models are being edited.

In terms of accuracy, do not place trust in the default distributions or in the key model assumptions. We’ve found that LLM outputs are often highly overconfident (despite some warnings in the prompts), and often leave out important considerations. This is another reason to do runs multiple times (often, different runs reach very different results), and to adjust models heavily with your own views. In the future, there’s clearly more engineering work that can be done in this area, yet it will likely be very difficult to do fully robustly.

State Machine Details

Squiggle AI makes use of a set of discrete steps, each with specific instructions and custom functionality.

  1. Generation: Makes a first attempt at writing Squiggle code, based on a provided prompt. Tries to naively follow our Style Guide, including writing some tests.

  2. Bug Fixing: Attempts to fix parsing or runtime errors. This step is run until there are no further errors.

  3. Update Estimates: Checks the results of the model. If there are broken tests or suspicious outputs, it will make changes.

  4. Document: Makes changes to better match the style guide. This typically means improving variable annotation and model documentation, but it sometimes also means adjusting variables or adding code organization.

Step 1 is run once, step 2 is run until the code successfully executes, and steps 3 and 4 are run as many times as the user requests. In informal testing, it seems like more runs of steps 3 and 4 improve performance up to some point.

A simplified state machine, showing a typical path. Note that the stages “Update Estimates” and “Document” can be optionally run 0 or more times, based on user request. Also, the “Fix” state is used whenever there is broken code, even if it comes from “Update Estimates” or “Document.”

Conclusion

We think Squiggle AI demonstrates the potential for combining LLMs with specialized programming frameworks to make probabilistic modeling more accessible. While the tool has limitations and should be used thoughtfully, early results suggest it can significantly reduce the barrier to entry for cost-effectiveness analysis and fermi estimation. We invite the community to heavily experiment with the tool and provide takes and feedback.


Appendix: Lessons Learned During Development

LLM cost-effective estimates seem tractable

We spent a few months on this scaffolding, and have found the final tool both useful and promising. There’s clearly a lot of useful work to be done in this area, even without fundamental advances in LLMs.

Given the current performance and limitations of Squiggle AI, it’s not clear how much we would expect it to be used now, especially without significant promotion and advertising. But it does seem clear that at the very least, future related tools have significant potential.

Cost-effectiveness estimation can fit into a larger category of “strategic decision-making.” So far, there’s been significantly more work to use LLMs to solve narrow tasks like coding problems, than there’s been to use LLMs to make high-level and general-purpose strategic decisions. While the latter might be difficult, we think it is tractable.

LLMs are willing to estimate more than you might expect, with some prompting

From Ozzie:

In my use with Squiggle AI, I’ve been surprised at how much the LLMs were willing to go with. Claude is known for being overly restrictive against potentially controversial topics, but in my experience, it was often willing to make controversial estimates, with the right prompting. Many humans complain about cost-benefit analyses about sensitive topics, but LLMs are often much more willing.

There have been a few times where I’d ask Claude to tell me which of a few decisions seemed the best, and it would be very reluctant to answer. Then I’d ask it to estimate the impact of each answer and rank them, and it would do this right away.

Specialized Steps Improve Results

LLMs with very long prompts tend to forget or neglect large prompt portions. This makes it difficult to rely on a large prompt or two. Instead, it often seemed better to have hand-crafted prompts for different situations. We did this for the four specific steps listed above so far, but imagine that there’s a lot of further expansion to do here.

Claude 3.5 Sonnet often only returns a limited amount of text (up to around 200 lines). This means that any one step can only write so much code or suggest so many changes. It therefore can take multiple prompts for more complex things.

It’s difficult to make complex models understandable

As with other kinds of code, there can be a lot going on with estimation using programming. On one hand, all the key numbers and assumptions are organized in code. On the other hand, there’s a lot of information to represent.

Many viewers might not understand programming well, so providing an overview to these people is a challenge. In addition to the code, variables need substantial detailing. Many quantitative estimates can be highly sensitive to specific assumptions and definitions, so these should be carefully specified. Not only should the programs and assumptions be presented—outputs should be expressed as well. These are often multidimensional and might do best with custom visualizations.

Quick understandability isn’t particularly important for model authors, who already have a deep understanding of their work. But in cases where an AI can very quickly write models, it’s more important for users to quickly be able to quickly understand the model inputs, assumptions, and outputs.

We’ve spent a lot of time on the Squiggle Playground and with a specific Style Guide to help address it, but this remains a significant challenge.

Prompt engineering can be a lot of work

A lot of building this tool has been making a big list of common errors and solutions that LLMs make when writing Squiggle code. Most of these were done at the prompt level, though we also added a bunch of regular expressions and formal code checks to detect common errors.

We haven’t set up formal evaluation systems yet, in large part because these can be complex and expensive to run. If we had an additional time and budget, we’d likely do more work here.

Crossposted from EA Forum (70 points, 11 comments)