Linguistic Price Tags: The Cost of Non-English LLM Prompting

Synopsis

This article examines the extra costs of prompting large language models (LLMs) in non-English languages and explores some of the practical implications. Using back-of-the-envelope calculations, it estimates the additional costs involved when prompting in Simplified Chinese and Hindi and proposes three-step prompting to save costs.

Introduction

While working on an upcoming paper about a multilingual version of the political compass test, I had a realization: There is cost incurred when prompting LLMs in non-English languages. To be clear, this goes beyond the documented differences in reasoning capabilities depending on the language of the prompt (Huang et al, 2024). Instead, it touches upon an often-overlooked aspect—the cost of interacting with LLMs in different languages.

In this article, I present evidence suggesting that it’s more expensive to prompt LLMs in Simplified Chinese and Hindi. These languages are not obscure dialects; they’re the second and third most spoken languages globally after English, and they’re predominant in the second and fifth largest economies, respectively. This disparity has implications for the accessibility and equitable distribution of AI technologies worldwide some of which are discussed in the following paragraphs.

Tokenization and Language

The economics of AI interaction is fundamentally tied to tokenization. When using a cloud-based AI service, costs are typically calculated on a per-token basis, where a token represents a unit of text or code. For those who own their own inference infrastructure, the primary cost lies in the increased electricity consumption. The fewer tokens the user and the model use to express their questions, ideas and concepts, the cheaper it is for the user.

To compare across languages, I define a quantity called token efficiency.

Token Efficiency(TE): The average amount of information contained in a token when the information is being expressed in a particular language.

The absolute values hold little meaning[1], while the reciprocal of the ratio of values of 2 languages defines, on average, the ratio of the number of tokens required to express the same idea and consequently, the cost associated. I compute it as

where;

Contraction Factor

Languages use different numbers of characters to express the same idea. I define the contraction factor as the mean of the ratio of sentence lengths in the dataset with their corresponding English counterparts[2]. By definition English has a value of 1. To compute this value, I use the train split from this dataset for Simplified Chinese, and this dataset for Hindi[3].

Characters per Token

This value is average number of characters represented by a token. The variation in across languages of this value originates in byte-pair-encoding algorithm used to vectorize data. The BPE algorithm iteratively assigns a new token to the most frequently occurring byte pair. The problem is, the 49.4% of text data on the internet is in English, with Chinese and Hindi constituting 1.2% and < 0.1% respectively (data from w3techs). This means the most frequently occurring byte-pairs will, more likely than not, correspond to English language characters, potentially impacting the representational capacity of the tokenization scheme for languages with less prevalent byte sequences.

Calculating Characters per Token

The value was calculated using the datasets mentioned in the previous section by taking the ratio of the character length of a sentence and the corresponding token length. For tokenization, I use the “o200k_base” encoding used by gpt-4o and gpt-4o-mini.

Quantifying Discrepancies

Estimating Token Efficiency

LanguageCF (mean)CF (median)CT (mean)CT (median)TE
English1[4]1[4]4.688[5]4.653[5]4.688
Mandarin3.3723.3021.2681.2404.276
Hindi1.1181.0212.7102.7673.030

The table shows the TE calculations, the mean and median are computed over all samples in the train split of corresponding dataset. Using the TE values, the overhead costs for Chinese and Hindi are calculated.

Estimating overheads

This is the percentage of additional tokens that would be required to express the same idea. For instance, if a concept requires 1 million English characters, and we employ OpenAI’s premier model, gpt-4o, the associated prompting expenses would be:

Implications

The calculations show that a discrepancy exists, which will influence the profitability of emerging businesses that are built around these LLMs, favoring businesses that serve English speaking demographics. Whether this has real world consequences is hard to say but this can potentially jeopardize the vision of having AI benefit everyone, and not worsening existing inequalities.

Potential solutions

The trivial solution would be to alter the BPE algorithm to make all languages have the same token efficiency. This is, in my opinion, rather impractical, however, it would be interesting to see how this impacts model performance.

A more pragmatic solution would be that the AI companies change their pricing based on the language of the prompt. This would come at a cost of reduced revenue, however, this would make the products of these companies be better aligned with their Mission statements.

OpenAI: Our mission is to ensure that artificial general intelligence benefits all of humanity.

Anthropic: To ensure transformative AI helps people and society flourish.

Mistral: Our mission is to make frontier AI ubiquitous, and to provide tailor-made AI to all the builders.

So, what’s the work-around?

Costs of flagship models will continue to rise

Cottier et.al. (2024) find the cost of training frontier models has risen 2.4x every year since 2016. These costs will eventually[6] be passed on to their customers. I believe this to be true for 2 reasons.

  • The competition in this space is intense, even if additional investments give diminishing return, the AI companies will continue to spend money for marginally better models for the fear of losing the race to achieve AGI.

  • Specialized models, I believe, are the future. These models will be more expensive than the current general purpose models. Businesses would have to pay the premium to stay competitive because the models will be superior, in their domain, compared to their generalized counterparts, example. o1-preview, OpenAI’s model specialized for solving problems in science, coding, and math, is priced at $15/​1M tokens. This is 6 times their general purpose flagship gpt-4o.

Needless to say, this will affect businesses catering to non-English speaking languages more, as shown by our calculations.

Cost-efficient models are good at translation

Distilled and quantized models perform worse than their full-sized counterparts in terms of reasoning, coding, and math capabilities. However, in my experience, these models perform great on translation tasks, and are a fraction of the the cost of flagships.

Saving money

Given the two factors;

  • Flagship models (and specialized models) will become more expensive.

  • Cost-efficient models can translate well.

I propose the following method that businesses catering to non-English speaking demographics can use to partially bridge the tokenization disadvantage while having access to SOTA reasoning capabilities.

  1. Translate prompt from non-English to English using a cheaper model

  2. Send English prompt to expensive model and receive English response

  3. Translate response back to original language

I refer to this as three-step prompting, which, when used in the correct contexts, can reduce the API costs of businesses serving non-English-speaking demographics. In the following sections I derive a formula for the cost that can be saved by using three-step prompting in instead of direct prompting.

Direct prompting

Indirect prompting

where:

Savings

Taking the difference between direct and indirect prompting cases;

Conclusion

In conclusion, this analysis sheds light on the often-overlooked financial implications of using large language models across diverse linguistic contexts. The cost differentials identified between English and non-English prompts, highlight the need for strategic approaches in multilingual AI applications.

Three-step prompting offers potential cost savings for businesses using AI in non-English markets. While promising, it’s not a universal solution and should be applied judiciously and only after studying the use-case. Addressing multilingual AI cost disparities is crucial for creating more inclusive and globally accessible technologies. Continued innovation in this area can help distribute AI benefits more equitably across languages.

References

Huang, Zixian, Wenhao Zhu, Gong Cheng, Lei Li, and Fei Yuan. “MindMerger: Efficient Boosting LLM Reasoning in non-English Languages.” arXiv preprint arXiv:2405.17386 (2024).

Liao, Han-Teng, King-wa Fu, and Scott A. Hale. “How much is said in a microblog? A multilingual inquiry based on Weibo and Twitter.” In Proceedings of the ACM Web Science Conference, pp. 1-9. 2015.

Cottier, Ben, Robi Rahman, Loredana Fattorini, Nestor Maslej, and David Owen. “The rising costs of training frontier AI models.” arXiv preprint arXiv:2405.21015 (2024).

  1. ^

    Mathematically, the value shows how many English characters, on average, the tokens would represent if the sentences were translated to English.

  2. ^

    I would expect the ratio to approach a constant value for large datasets, and depend on the context of the dataset e.g. news, video transcripts, social media posts etc. I could not find any research on this, if you find any let me know!

  3. ^

    Ideally this value should be calculated from application specific data but since the point of this article is to demonstrate differences, these public datasets suffice.

  4. ^

    By definition.

  5. ^

    These values are the average of the Chinese and Hindi datasets. The calculated value is in the vicinity of the value from OpenAI.

  6. ^
No comments.