My tentative conclusion: by quantitative metrics, DeepL is in the same league as Google Translate, and might be better by some metrics. Which is still an impressive achievement by DeepL, considering the fact that they have orders-of-magnitude less data, compute, and researchers than Google.
Do they though? Google is a large company, certainty, but they might not actually give Google Translate researchers a lot of funding. Google gets revenue from translation by offering it as a cloud service, but I found this thread from 2018 where someone said,
Google Translate and Cloud Translation API are two different products doing some similar functions. It is safe to assume they have different algorithms and differences in translations are not only common but expected.
From this, it appears that there is little incentive for Google to improve the algorithms on Google Translate.
We could try to guesstimate how much money Google spends on the R&D for Google Translate.
Judging by these data, Google Translate has 500 million daily users, or 11% of the total Internet population worldwide.
Not sure how much it costs Google to run a service of such a scale, but I would guesstimate that it’s in the order of ~$1 bln / year.
If they spend 90% of the total Translate budget on keeping the service online, and 10% on the R&D, they have ~$100 mln / year on the R&D, which is likely much more than the total income of DeepL.
It is unlikely that Google is spending ~$100 mln / year on something without a very good reason.
One of the possible reasons is training data. Judging by the same source, Google Translate generates 100 billion words / day. It means, Google gets about the same massive amount of new user-generated training data per day.
The amount is truly massive: thanks to Google Translate, Google gets the Library-of-Congress worth of new data per year, multiplied by 10.
And some of the new data is hard to get by other means (e.g. users trying to translate their personal messages from a low-resource language like Basque or Azerbaijani).
I think your estimate for the operating costs (mostly compute) of running a service that has 500 million daily users (with probably well under 10 requests per user on average, and pretty short average input) is too high by 4 orders of magnitude, maybe even 5. My main uncertainty is how exactly they’re servicing each request (i.e. what % of requests are novel enough to actually need to get run though some more-expensive ML model instead of just returning a cached result).
5 billion requests per day translates to ~58k requests per second. You might have trouble standing up a service that handles that level of traffic on a 10k/year compute budget if you buy compute on a public cloud (i.e. AWS/GCP) but those have pretty fat margins so the direct cost to Google themselves would be a lot lower.
Do they though? Google is a large company, certainty, but they might not actually give Google Translate researchers a lot of funding. Google gets revenue from translation by offering it as a cloud service, but I found this thread from 2018 where someone said,
From this, it appears that there is little incentive for Google to improve the algorithms on Google Translate.
We could try to guesstimate how much money Google spends on the R&D for Google Translate.
Judging by these data, Google Translate has 500 million daily users, or 11% of the total Internet population worldwide.
Not sure how much it costs Google to run a service of such a scale, but I would guesstimate that it’s in the order of ~$1 bln / year.
If they spend 90% of the total Translate budget on keeping the service online, and 10% on the R&D, they have ~$100 mln / year on the R&D, which is likely much more than the total income of DeepL.
It is unlikely that Google is spending ~$100 mln / year on something without a very good reason.
One of the possible reasons is training data. Judging by the same source, Google Translate generates 100 billion words / day. It means, Google gets about the same massive amount of new user-generated training data per day.
The amount is truly massive: thanks to Google Translate, Google gets the Library-of-Congress worth of new data per year, multiplied by 10.
And some of the new data is hard to get by other means (e.g. users trying to translate their personal messages from a low-resource language like Basque or Azerbaijani).
I think your estimate for the operating costs (mostly compute) of running a service that has 500 million daily users (with probably well under 10 requests per user on average, and pretty short average input) is too high by 4 orders of magnitude, maybe even 5. My main uncertainty is how exactly they’re servicing each request (i.e. what % of requests are novel enough to actually need to get run though some more-expensive ML model instead of just returning a cached result).
5 billion requests per day translates to ~58k requests per second. You might have trouble standing up a service that handles that level of traffic on a 10k/year compute budget if you buy compute on a public cloud (i.e. AWS/GCP) but those have pretty fat margins so the direct cost to Google themselves would be a lot lower.