As a human engineer who has done applied classical (IE:non-AI, you write the algorithms yourself) computer vision. That’s not a good lower bound.
Image processing was a thing before computers were fast. Here’s a 1985 paper talking about tomato sorting. Anything involving a kernel applied over the entire image is way too slow. All the algorithms are pixel level.
Note that this is a fairly easy problem if only because once you know what you’re looking for, it’s pretty easy to find it thanks to the court being not too noisy.
An O(N) algorithm is iffy at these speeds. Applying a 3x3 kernel to the image won’t work.
So let’s cut down on the amount of work to do. Look at only 1 out of every 16 pixels to start with. Here’s an (80*60) pixel image formed by sampling one pixel in every 4x4 square of the original.
The closer player is easy to identify. Remember that we still have all the original image pixels. If there’s a potentially interesting feature (like the player further away), we can look at some of the pixels we’re ignoring to double check.
Since we have 3 images, and if we can’t do some type of clever reduction after the first image, then we’ll have to spend 1.1 seconds on each of them as well.
Cropping is very simple, once you find the player that’s serving, focus on that rectangle in later images. I’ve done exactly this to get CV code that was 8FPS@100%CPU down to 30FPS@5%. Once you know where a thing is, tracking it from frame to frame is much easier.
Concretely, the computer needs to:
locate the player serving and their hands/ball (requires looking at whole image)
track the player’s arm/hand movements pre-serve
track the ball and racket during toss into the air
track the ball after impact with the racket
continue ball tracking
Only step 1 requires looking at the whole image. And there, only to get an idea of what’s around you. Once the player is identified, crop to them and maintain focus. If the camera/robot is mobile, also glance at fixed landmarks (court lines, net posts/net/fences) to do position tracking.
If we assume the 286 is interfacing with a modern high resolution image sensor which can do downscaling (IE:you can ask it to average 2*2 4*4 8*8 etc. blocks of pixels) and windowing (You can ask for a rectangular chunk of the image to be read out. This gets you closer to what the brain is working with (small high resolution patch in the center of visual field + low res peripheral vision on moveable eyeball)
Conditional computation is still common in low end computer vision systems. Face detection is a good example. OpenCV Face Detection: Visualized. You can imagine that once you know where the face is in one frame tracking it to the next frame will be much easier.
Now maybe you’re thinking: “That’s on me I, set the bar too low”
Well human vision is pretty terrible. Resolution of the fovea is good but that’s about a 1 degree circle in your field of vision. move past 5° and that’s peripheral vision, which is crap. Humans don’t really see their full environment.
Current practical applications of this is to cut down on graphics quality in VR headsets using eye tracking. More accurate and faster tracking allows more aggressive cuts to total pixels rendered.
As a human engineer who has done applied classical (IE:non-AI, you write the algorithms yourself) computer vision. That’s not a good lower bound.
Image processing was a thing before computers were fast. Here’s a 1985 paper talking about tomato sorting. Anything involving a kernel applied over the entire image is way too slow. All the algorithms are pixel level.
Note that this is a fairly easy problem if only because once you know what you’re looking for, it’s pretty easy to find it thanks to the court being not too noisy.
An O(N) algorithm is iffy at these speeds. Applying a 3x3 kernel to the image won’t work.
So let’s cut down on the amount of work to do. Look at only 1 out of every 16 pixels to start with. Here’s an (80*60) pixel image formed by sampling one pixel in every 4x4 square of the original.
The closer player is easy to identify. Remember that we still have all the original image pixels. If there’s a potentially interesting feature (like the player further away), we can look at some of the pixels we’re ignoring to double check.
Cropping is very simple, once you find the player that’s serving, focus on that rectangle in later images. I’ve done exactly this to get CV code that was 8FPS@100%CPU down to 30FPS@5%. Once you know where a thing is, tracking it from frame to frame is much easier.
Concretely, the computer needs to:
locate the player serving and their hands/ball (requires looking at whole image)
track the player’s arm/hand movements pre-serve
track the ball and racket during toss into the air
track the ball after impact with the racket
continue ball tracking
Only step 1 requires looking at the whole image. And there, only to get an idea of what’s around you. Once the player is identified, crop to them and maintain focus. If the camera/robot is mobile, also glance at fixed landmarks (court lines, net posts/net/fences) to do position tracking.
If we assume the 286 is interfacing with a modern high resolution image sensor which can do downscaling (IE:you can ask it to average 2*2 4*4 8*8 etc. blocks of pixels) and windowing (You can ask for a rectangular chunk of the image to be read out. This gets you closer to what the brain is working with (small high resolution patch in the center of visual field + low res peripheral vision on moveable eyeball)
Conditional computation is still common in low end computer vision systems. Face detection is a good example. OpenCV Face Detection: Visualized. You can imagine that once you know where the face is in one frame tracking it to the next frame will be much easier.
Now maybe you’re thinking: “That’s on me I, set the bar too low”
Well human vision is pretty terrible. Resolution of the fovea is good but that’s about a 1 degree circle in your field of vision. move past 5° and that’s peripheral vision, which is crap. Humans don’t really see their full environment.
You’ve probably seen this guy? Most people don’t see him the first time because they focus on the ball.
Current practical applications of this is to cut down on graphics quality in VR headsets using eye tracking. More accurate and faster tracking allows more aggressive cuts to total pixels rendered.
This is why where’s waldo is hard for humans.