One type of question that would be straightforward for humans to answer, but difficult to train a machine learning model to answer reliably, would be to ask “How much money is visible in this picture?” for images like this:
If you have pictures with bills, coins, and non-money objects in random configurations—with many items overlapping and partly occluding each other—it is still fairly easy for humans to pick out what is what from the image.
But to get an AI to do this would be more difficult than a normal image classification problem where you can just fine tune a vision model with a bunch of task-relevant training cases. It would probably require multiple denomination-specific visions models working together, as well as some robust way for the model to determine where one object ends and another begins.
I would also expect such an AI to be more confounded by any adversarial factors—such as the inclusion of non-money arcade tokens or drawings of coins or colored-in circles—added to the image.
Now, maybe to solve this in under one minute some people would need to start the timer when they already have a calculator in hand (or the captcha screen would need to include an on-screen calculator). But in general, as long as there is not a huge number of coins and bills, I don’t think this type of captcha would take the average person more than say 3-4 times longer than it takes them to compete the “select all squares with traffic lights” type captchas in use now. (Though some may want to familiarize themselves with the various $1.00 and $0.50 coins that exist and some the variations of the tails sides of quarters if this becomes the new prove-you-are-a-human method.)
I can see the numbers on the notes and infer that they denote United States Dollars, but have zero idea of what the coins are worth. I would expect that anyone outside United States would have to look up every coin type and so take very much more than 3-4 times longer clicking images with boats. Especially if the coins have multiple variations.
If the image additionally included coin-like tokens, it would be a nontrivial research project (on the order of an hour) to verify that each such object is in fact not any form of legal tender, past or present, in the United States.
Even if all the above were solved, you still need such images to be easily generated in a manner that any human can solve it fairly quickly but a machine vision system custom trained to solve this type of problem, based on at least thousands of different examples, can’t. This is much harder than it sounds.
I can see the numbers on the notes and infer that they denote United States Dollars, but have zero idea of what the coins are worth. I would expect that anyone outside United States would have to look up every coin type and so take very much more than 3-4 times longer clicking images with boats. Especially if the coins have multiple variations.
If a system like this were widely deployed online using US currency, people outside the US would need to familiarize themselves with US currency if they are not already familiar with it. But they would only need to do this once and then it should be easy to remember for subsequent instances. There are only 6 denominations of US coins in circulation - $0.01, $0.05, $0.10, $0.25, $0.50, and $1.00 - and although there are variations for some of them, they mostly follow a very similar pattern. They also frequently have words on them like “ONE CENT” ($0.01) or “QUARTER DOLLAR” ($0.25) indicating the value, so it should be possible for non-US people to become familiar with those.
Alternatively, an easier option could be using country specific-captchas which show a picture like this except with the currency of whatever country the internet user is in. This would only require extra work for VPN users who seek to conceal their location by having the VPN make it look like they are in some other country.
If the image additionally included coin-like tokens, it would be a nontrivial research project (on the order of an hour) to verify that each such object is in fact not any form of legal tender, past or present, in the United States.
The idea was they the tokens would only be similar in broad shape and color—but would be different enough from actual legal tender coins that I would expect a human to easily tell the two apart.
Even if all the above were solved, you still need such images to be easily generated in a manner that any human can solve it fairly quickly but a machine vision system custom trained to solve this type of problem, based on at least thousands of different examples, can’t. This is much harder than it sounds.
I agree that the difficulty of generating a lot of these is the main disadvantage, as you would probably have to just take a huge number of real pictures like this which would be very time consuming. It is not clear to me that Dall-E or other AI image generators could produce such pictures with enough realism and detail that it would be possible for human users to determine how much money is supposed to be in the fake image (and have many humans all converge to the same answer). You also might get weird things using Dall-E for this, like 2 corners of the same bill having different numbers indicating the bill’s denomination.
But I maintain that, once a large set of such images exists, training a custom machine vision system to solve these would be very difficult. It would require much more work than simply fine tuning an off-the-shelf vision system to answer the binary question of “Does this image contain a bus?”.
Suppose that, say, a few hundred people worked for several months to create 1,000,000 of these in total and then started deploying them. If you are a malicious AI developer trying to crack this, the mere tasks of compiling a properly labeled data set (or multiple data sets) and deciding how many sub-models to train and how they should cooperate (if you use more than one) are already non-trivial problems that you have to solve just to get started. So I think it would take more than a few days.
This idea is really brilliant I think, quite promising that it could work. It requires the image AI to understand the entire image, it is hard to divide it up into one frame per bill/coin. And it can’t use the intelligence of LLM models easily.
To aid the user, on the side there could be a clear picture of each coin and their worth, that we we could even have made up coins, that could further trick the AI.
All this could be combined with traditional image obfucation techniques (like making them distorted.
I’m not entirely sure how to generate images of money efficiently, Dall-E couldn’t really do it well in the test I ran. Stable diffusion probably would do better though.
If we create a few thousand real world images of money though, they might be possible to combine and obfuscate and delete parts of them in order to make several million different images. Like one bill could be taken from one image, and then a bill from another image could be placed on top of it etc.
To aid the user, on the side there could be a clear picture of each coin and their worth, that we we could even have made up coins, that could further trick the AI.
A user aid showing clear pictures of all available legal tender coins is a very good idea. It avoids problems more obscure coins which may have been only issued in a single year—so the user is not sitting there thinking “wait a second, did they actually issue a Ulysses S. Grant coin at some point or it that just there to fool the bots?”.
I’m not entirely sure how to generate images of money efficiently, Dall-E couldn’t really do it well in the test I ran. Stable diffusion probably would do better though.
If we create a few thousand real world images of money though, they might be possible to combine and obfuscate and delete parts of them in order to make several million different images. Like one bill could be taken from one image, and then a bill from another image could be placed on top of it etc.
I agree that efficient generation of these types of images is the main difficulty and probable bottleneck to deploying something like this if websites try to do so. Taking a large number of such pictures in real life would be time consuming. If you could speed up the process by automated image generation or automated creation of synthetic images by copying and pasting bills or notes between real images, that would be very useful. But doing that while preserving photo-realism and clarity to human users of how much money is in the image would be tricky.
Perhaps an advanced game engine could be used to create lots of simulations of piles of money. Like, if 100 3d objects of money are created (like 5 coins, 3 bills with 10 variations each (like folded etc), some fake money and other objects). Then these could be randomly generated into constellations. Further, it would then be possible to make videos instead of pictures, which makes it even harder for AI’s to classify. Like, imagine the camera changing angel of a table, and a minimum of two angels are needed to see all bills.
I don’t think the photos/videos needs to be super realistic, we can add different types of distortions to make it harder for the AI to find patterns.
One type of question that would be straightforward for humans to answer, but difficult to train a machine learning model to answer reliably, would be to ask “How much money is visible in this picture?” for images like this:
If you have pictures with bills, coins, and non-money objects in random configurations—with many items overlapping and partly occluding each other—it is still fairly easy for humans to pick out what is what from the image.
But to get an AI to do this would be more difficult than a normal image classification problem where you can just fine tune a vision model with a bunch of task-relevant training cases. It would probably require multiple denomination-specific visions models working together, as well as some robust way for the model to determine where one object ends and another begins.
I would also expect such an AI to be more confounded by any adversarial factors—such as the inclusion of non-money arcade tokens or drawings of coins or colored-in circles—added to the image.
Now, maybe to solve this in under one minute some people would need to start the timer when they already have a calculator in hand (or the captcha screen would need to include an on-screen calculator). But in general, as long as there is not a huge number of coins and bills, I don’t think this type of captcha would take the average person more than say 3-4 times longer than it takes them to compete the “select all squares with traffic lights” type captchas in use now. (Though some may want to familiarize themselves with the various $1.00 and $0.50 coins that exist and some the variations of the tails sides of quarters if this becomes the new prove-you-are-a-human method.)
I can see the numbers on the notes and infer that they denote United States Dollars, but have zero idea of what the coins are worth. I would expect that anyone outside United States would have to look up every coin type and so take very much more than 3-4 times longer clicking images with boats. Especially if the coins have multiple variations.
If the image additionally included coin-like tokens, it would be a nontrivial research project (on the order of an hour) to verify that each such object is in fact not any form of legal tender, past or present, in the United States.
Even if all the above were solved, you still need such images to be easily generated in a manner that any human can solve it fairly quickly but a machine vision system custom trained to solve this type of problem, based on at least thousands of different examples, can’t. This is much harder than it sounds.
If a system like this were widely deployed online using US currency, people outside the US would need to familiarize themselves with US currency if they are not already familiar with it. But they would only need to do this once and then it should be easy to remember for subsequent instances. There are only 6 denominations of US coins in circulation - $0.01, $0.05, $0.10, $0.25, $0.50, and $1.00 - and although there are variations for some of them, they mostly follow a very similar pattern. They also frequently have words on them like “ONE CENT” ($0.01) or “QUARTER DOLLAR” ($0.25) indicating the value, so it should be possible for non-US people to become familiar with those.
Alternatively, an easier option could be using country specific-captchas which show a picture like this except with the currency of whatever country the internet user is in. This would only require extra work for VPN users who seek to conceal their location by having the VPN make it look like they are in some other country.
The idea was they the tokens would only be similar in broad shape and color—but would be different enough from actual legal tender coins that I would expect a human to easily tell the two apart.
Some examples would be:
https://barcade.com/wp-content/uploads/2021/07/BarcadeToken_OPT.png
https://www.pinterest.com/pin/64105994675283502/
I agree that the difficulty of generating a lot of these is the main disadvantage, as you would probably have to just take a huge number of real pictures like this which would be very time consuming. It is not clear to me that Dall-E or other AI image generators could produce such pictures with enough realism and detail that it would be possible for human users to determine how much money is supposed to be in the fake image (and have many humans all converge to the same answer). You also might get weird things using Dall-E for this, like 2 corners of the same bill having different numbers indicating the bill’s denomination.
But I maintain that, once a large set of such images exists, training a custom machine vision system to solve these would be very difficult. It would require much more work than simply fine tuning an off-the-shelf vision system to answer the binary question of “Does this image contain a bus?”.
Suppose that, say, a few hundred people worked for several months to create 1,000,000 of these in total and then started deploying them. If you are a malicious AI developer trying to crack this, the mere tasks of compiling a properly labeled data set (or multiple data sets) and deciding how many sub-models to train and how they should cooperate (if you use more than one) are already non-trivial problems that you have to solve just to get started. So I think it would take more than a few days.
This idea is really brilliant I think, quite promising that it could work. It requires the image AI to understand the entire image, it is hard to divide it up into one frame per bill/coin. And it can’t use the intelligence of LLM models easily.
To aid the user, on the side there could be a clear picture of each coin and their worth, that we we could even have made up coins, that could further trick the AI.
All this could be combined with traditional image obfucation techniques (like making them distorted.
I’m not entirely sure how to generate images of money efficiently, Dall-E couldn’t really do it well in the test I ran. Stable diffusion probably would do better though.
If we create a few thousand real world images of money though, they might be possible to combine and obfuscate and delete parts of them in order to make several million different images. Like one bill could be taken from one image, and then a bill from another image could be placed on top of it etc.
A user aid showing clear pictures of all available legal tender coins is a very good idea. It avoids problems more obscure coins which may have been only issued in a single year—so the user is not sitting there thinking “wait a second, did they actually issue a Ulysses S. Grant coin at some point or it that just there to fool the bots?”.
I agree that efficient generation of these types of images is the main difficulty and probable bottleneck to deploying something like this if websites try to do so. Taking a large number of such pictures in real life would be time consuming. If you could speed up the process by automated image generation or automated creation of synthetic images by copying and pasting bills or notes between real images, that would be very useful. But doing that while preserving photo-realism and clarity to human users of how much money is in the image would be tricky.
Perhaps an advanced game engine could be used to create lots of simulations of piles of money. Like, if 100 3d objects of money are created (like 5 coins, 3 bills with 10 variations each (like folded etc), some fake money and other objects). Then these could be randomly generated into constellations. Further, it would then be possible to make videos instead of pictures, which makes it even harder for AI’s to classify. Like, imagine the camera changing angel of a table, and a minimum of two angels are needed to see all bills.
I don’t think the photos/videos needs to be super realistic, we can add different types of distortions to make it harder for the AI to find patterns.