If only 90% can solve the captcha within one minute, it does not follow that the other 10% are completely unable to solve it and faced with “yet another barrier to living in our modern society”.
It could be that the other 10% just need a longer time period to solve it (which might still be relatively trivial, like needing 2 or 3 minutes) or they may need multiple tries.
If we are talking about someone at the extreme low end of the captcha proficiency distribution, such that the person can not even solve in a half hour something that 90% of the population can answer in 60 seconds, then I would expect that person to already need assistance with setting up an email account/completing government forms online/etc, so whoever is helping them with that would also help with the captcha.
(I am also assuming that this post is only for vision-based captchas, and blind people would still take a hearing-based alternative.)
If a system like this were widely deployed online using US currency, people outside the US would need to familiarize themselves with US currency if they are not already familiar with it. But they would only need to do this once and then it should be easy to remember for subsequent instances. There are only 6 denominations of US coins in circulation - $0.01, $0.05, $0.10, $0.25, $0.50, and $1.00 - and although there are variations for some of them, they mostly follow a very similar pattern. They also frequently have words on them like “ONE CENT” ($0.01) or “QUARTER DOLLAR” ($0.25) indicating the value, so it should be possible for non-US people to become familiar with those.
Alternatively, an easier option could be using country specific-captchas which show a picture like this except with the currency of whatever country the internet user is in. This would only require extra work for VPN users who seek to conceal their location by having the VPN make it look like they are in some other country.
The idea was they the tokens would only be similar in broad shape and color—but would be different enough from actual legal tender coins that I would expect a human to easily tell the two apart.
Some examples would be:
https://barcade.com/wp-content/uploads/2021/07/BarcadeToken_OPT.png
https://www.pinterest.com/pin/64105994675283502/
I agree that the difficulty of generating a lot of these is the main disadvantage, as you would probably have to just take a huge number of real pictures like this which would be very time consuming. It is not clear to me that Dall-E or other AI image generators could produce such pictures with enough realism and detail that it would be possible for human users to determine how much money is supposed to be in the fake image (and have many humans all converge to the same answer). You also might get weird things using Dall-E for this, like 2 corners of the same bill having different numbers indicating the bill’s denomination.
But I maintain that, once a large set of such images exists, training a custom machine vision system to solve these would be very difficult. It would require much more work than simply fine tuning an off-the-shelf vision system to answer the binary question of “Does this image contain a bus?”.
Suppose that, say, a few hundred people worked for several months to create 1,000,000 of these in total and then started deploying them. If you are a malicious AI developer trying to crack this, the mere tasks of compiling a properly labeled data set (or multiple data sets) and deciding how many sub-models to train and how they should cooperate (if you use more than one) are already non-trivial problems that you have to solve just to get started. So I think it would take more than a few days.