“you’re jumping to the conclusion that you can reliably differentiate between...”
I think you absolutely can, and the idea was already described earlier.
You pay attention to regularities in the data. In most non-random images, pixels near to each other are similar. In an MxN image, the pixel below is a[i+M], whereas in an NxM image, it’s a[i+N]. If, across the whole image, the difference between a[i+M] is less than the difference between a[i+N], it’s more likely an MxN image. I expect you could find the resolution by searching all possible resolutions from 1x<length> to <length>x1, and finding which minimizes average distance of “adjacent” pixels.
Similarly (though you’d likely do this first), you can tell the difference between RGB and RGBA. If you have (255, 0, 0, 255, 0, 0, 255, 0, 0, 255, 0, 0), this is probably 4 red pixels in RGB, and not a fully opaque red pixel, followed by a fully transparent green pixel, followed by a fully transparent blue pixel in RGBA. It could be 2 pixels that are mostly red and slightly green in 16 bit RGB, though. Not sure how you could piece that out.
Aliens would probably do a different encoding. We don’t know what the rods and cones in their eye-equivalents are, and maybe they respond to different colors. Maybe it’s not Red Green Blue, but instead Purple Chartreuse Infrared. I’m not sure this matters. It just means your eggplants look red.
I think, even if it had 5 (or 6, or 20) channels, this regularity would be born out, between bit i and bit i+5 being less than bit i and i+1, 2, 3, or 4.
Now, there’s still a lot that that doesn’t get you yet. But given that there are ways to figure out those, I kinda think I should have decent expectations there’s ways to figure out other things, too, even if I don’t know them.
I do also think it’s important to zoom out to the original point. Eliezer posed this as an idea about AGI. We currently sometimes feed images to our AIs, and when we do, we feed them as raw RGB data, not encoded, because we know that would make it harder for the AI to figure out. I think it would be very weird, if we were trying to train an AI, to send it compressed video, and much more likely that we do, in fact, send it raw RGB values frame by frame.
I will also say that the original claim (by Eliezer, not the top of this thread), was not physics from one frame, but physics from like, 3 frames, so you get motion, and acceleration. 4 frames gets you to third derivatives, which, in our world, don’t matter that much. Having multiple frames also aids in ideas like the 3d → 2d projection, since motion and occlusion are hints at that.
And I think the whole question is “does this image look reasonable”, which you’re right, is not a rigorous mathematical concept. But “‘looks reasonable’ is not a rigorous concept” doesn’t get followed by “and therefore is impossible” Above are some of the mathematical descriptions of what ‘reasonable’ means in certain contexts. Rendering a 100x75 image as 75x100 will not “look reasonable”. But it’s not beyond thinking and math to determine what you mean by that.
The core problem remains computational complexity.
Statements like “does this image look reasonable” or saying “you pay attention to regularities in the data”, or “find the resolution by searching all possible resolutions” are all hiding high computational costs behind short English descriptions.
Let’s consider the case of a 1280x720 pixel image. That’s the same as 921600 pixels.
How many bytes is that?
It depends. How many bytes per pixel?[1] In my post, I explained there could be 1-byte-per-pixel grayscale, or perhaps 3-bytes-per-pixel RGB using [0, 255] values for each color channel, or maybe 6-bytes-per-pixel with [0, 65535] values for each color channel, or maybe something like 4-bytes-per-pixel because we have 1-byte RGB channels and a 1-byte alpha channel.
Let’s assume that a reasonable cutoff for how many bytes per pixel an encoding could be using is say 8 bytes per pixel, or a hypothetical 64-bit color depth.
How many ways can we divide this between channels?
If we assume 3 channels, it’s 1953. If we assume 4 channels, it’s 39711. Also if it turns out to be 5 channels, it’s 595665.
This is a pretty fast growing function. The following is a plot.
Note that the red line is O(2^N) and the black line barely visible at the bottom is O(N^2). N^2 is a notorious runtime complexity because it’s right on the threshold of what is generally unacceptable performance.[2]
Let’s hope that this file isn’t actually a frame buffer from a graphics card with 32 bits per channel or a 128 bit per pixel / 16 byte per pixel.
Unfortunately, we still need to repeat this calculation for all of the possibilities for how many bits per pixel this image could be. We need to add in the possibility that it is 63 bits per pixel, or 62 bits per pixel, or 61 bits per pixel.
In case anyone wants to claim this is unreasonable, it’s not impossible to have image formats that have RGBA data, but only 1 bit associated with the alpha data for each pixel. [3]
And for each of these scenarios, we need to question how many channels of color data there are.
1? Grayscale.
2? Grayscale, with an alpha channel maybe?
3? RGB, probably, or something like HSV.
4? RGBA, or maybe it’s the RGBG layout I described for a RAW encoding of a Bayer filter, or maybe it’s CMYK for printing.
5? This is getting weird, but it’s not impossible. We could be encoding additional metadata into each pixel, e.g. distance from the camera.
6? Actually, this question how how many channels there are is very important, given the fast growing function above.
7? This one question, if we don’t know the right answer, is sufficient to make this algorithm pretty much impossible to run.
8? When we say we can try all of options, that’s not actually possible.
9? What I think people mean is that we can use heuristics to pick the likely options first and try them, and then fall back to more esoteric options if the initial results don’t make sense.
10? That’s the difference between average run-time and worst case run-time.
11? The point that I am trying to make is that the worst case run-time for decoding an arbitrary binary file is pretty much unbounded, because there’s a ridiculous amount of choice possible.
12? Some examples of “image” formats that have large numbers of channels per “pixel” are things like RADAR / LIDAR sensors, e.g. it’s possible to have 5 channels per pixel for defining 3D coordinates (relative to the sensor), range, and intensity.
You actually ran into this problem yourself.
Similarly (though you’d likely do this first), you can tell the difference between RGB and RGBA. If you have (255, 0, 0, 255, 0, 0, 255, 0, 0, 255, 0, 0), this is probably 4 red pixels in RGB, and not a fully opaque red pixel, followed by a fully transparent green pixel, followed by a fully transparent blue pixel in RGBA. It could be 2 pixels that are mostly red and slightly green in 16 bit RGB, though. Not sure how you could piece that out.
Summing up all of the possibilities above is left as an exercise for the reader, and we’ll call that sum K.
Without loss of generality, let’s say our image was encoded as 3 bytes per pixel divided between 3 RGB color channels of 1 byte each.
Our 1280x720 image is actually 2764800 bytes as a binary file.
But since we’re decoding it from the other side, and we don’t know it’s 1280x720, when we’re staring at this pile of 2764800 bytes, we need to first assume how many bytes per pixel it is, so that we can divide the total bytes by the bytes per pixel to calculate the number of pixels.
Then, we need to test each possible resolutions as you’ve suggested.
The number of possible resolutions is the same as the number of divisors of the number of pixels. The equation for providing an upper bound is exp(log(N)/log(log(N)))[4], but the average number of divisors is approximately log(N).
Oops, no it isn’t!
Files have headers! How large is the header? For a bitmap, it’s anywhere between 26 and 138 bytes. The JPEG header is at least 2 bytes. PNG uses 8 bytes. GIF uses at least 14 bytes.
Now we need to make the following choices:
Guess at how many bytes per pixel the data is.
Guess at the length of the header. (maybe it’s 0, there is no header!)
Calculate the factorization of the remaining bytes N for the different possible resolutions.
Hope that there isn’t a footer, checksum, or any type of other metadata hanging out in the sea of bytes. This is common too!
Once we’ve made our choices above, then we multiply that by log(N) for the number of resolutions to test, and then we’ll apply the suggested metric. Remember that when considering the different pixel formats and ways the color channel data could be represented, the number was K, and that’s what we’re multiplying by log(N).
In most non-random images, pixels near to each other are similar. In an MxN image, the pixel below is a[i+M], whereas in an NxM image, it’s a[i+N]. If, across the whole image, the difference between a[i+M] is less than the difference between a[i+N], it’s more likely an MxN image. I expect you could find the resolution by searching all possible resolutions from 1x<length> to <length>x1, and finding which minimizes average distance of “adjacent” pixels.
What you’re describing here is actually similar to a common metric used in algorithms for automatically focusing cameras by calculating the contrast of an image, except for focusing you want to maximize contrast instead of minimize it.
The interesting problem with this metric is that it’s basically a one-way function. For a given image, you can compute this metric. However, minimizing this metric is not the same as knowing that you’ve decoded the image correctly. It says you’ve found a decoding, which did minimize the metric. It does not mean that is the correct decoding.
A trivial proof:
Consider an image and the reversal of that image along the horizontal axis.
These have the same metric.
So the same metric can yield two different images.
A slightly less trivial proof:
For a given “image” of N bytes of image data, there are 2^(N*8) possible bit patterns.
Assuming the metric is calculated as an 8-byte IEEE 754 double, there are only 2^(8*8) possible bit patterns.
When N > 8, there are more bit patterns than values allowed in a double, so multiple images need to map to the same metric.
The difference between our 2^(2764800*8) image space and the 2^64 metric is, uhhh, 10^(10^6.8). Imagine 10^(10^6.8) pigeons. What a mess.[5]
The metric cannot work as described. There will be various arbitrary interpretations of the data possible to minimize this metric, and almost all of those will result in images that are definitely not the image that was actually encoded, but did minimize the metric. There is no reliable way to do this because it isn’t possible. When you have a pile of data, and you want to reverse meaning from it, there is not one “correct” message that you can divine from it.[6] See also: numerology, for an example that doesn’t involve binary file encodings.
Even pretending that this metric did work, what’s the time complexity of it? We have to check each pixel, so it’s O(N). There’s a constant factor for each pixel computation. How large is that constant? Let’s pretend it’s small and ignore it.
So now we’ve got K*O(N*log(N)) which is the time complexity of lots of useful algorithms, but we’ve got that awkward constant K in the front. Remember that the constant K reflects the number of choices for different bits per pixel, bits per channel, and the number of channels of data per pixel. Unfortunately, that constant is the one that was growing a rate best described as “absurd”. That constant is the actual definition of what it means to have no priors. When I said “you can generate arbitrarily many hypotheses, but if you don’t control what data you receive, and there’s no interaction possible, then you can’t rule out hypotheses”, what I’m describing is this constant.
I think it would be very weird, if we were trying to train an AI, to send it compressed video, and much more likely that we do, in fact, send it raw RGB values frame by frame.
What I care about is the difference between:
1. Things that are computable. 2. Things that are computable efficiently.
These sets are not the same. Capabilities of a superintelligent AGI lie only in the second set, not the first.
It is important to understand that a superintelligent AGI is not brute forcing this in the way that has been repeatedly described in this thread. Instead the superintelligent AGI is going to use a bunch of heuristics or knowledge about the provenance of the binary file, combined with access to the internet so that it can just lookup the various headers and features of common image formats, and it’ll go through and check all of those, and then if it isn’t any of the usual suspects, it’ll throw up metaphorical hands, and concede defeat. Or, to quote the title of this thread, intelligence isn’t magic.
A fun question to consider here becomes: where are the alpha bits stored? E.g. if we assume 3 bytes for RGB data, and then we have the 1 alpha bit, is each pixel taking up 9 bits, or are the pixels stored in runs of 8 pixels followed by a single “alpha” pixel with 8 bits describing the alpha channels of the previous 8 pixels?
The way this works for real reverse engineering is that we already have expectations of what the data should look like, and we are tweaking inputs and outputs until we get the data we expected. An example would be figuring out a camera’s RAW format by taking pictures of carefully chosen targets like an all white wall, or a checkerboard wall, or an all red wall, and using the knowledge of those targets to find patterns in the data that we can decode.
So, I want to note a few things. The original Eliezer post was intended to argue against this line of reasoning:
I occasionally run into people who say something like, “There’s a theoretical limit on how much you can deduce about the outside world, given a finite amount of sensory data.”
He didn’t worry about compute, because that’s not a barrier on the theoretical limit. And in his story, the entire human civilization had decades to work on this problem.
But you’re right, in a practical world, compute is important.
I feel like you’re trying to make this take as much compute as possible.
Since you talked about headers, I feel like I need to reiterate that, when we are talking to a neural network, we do not add the extra data. The goal is to communicate with the neural network, so we intentionally put it in easier to understand formats.
In the practical cases for this to come up (e.g. a nascent superintelligence figuring out physics faster than we expect), we probably will also be inputting data in an easy to understand format.
Similarly, I expect you don’t need to check every possible esoteric format. The likelihood of the image using an encoding like 61 bits per pixel, with 2 for red, 54 for green and 5 for blue is just, very low, a priori. I do admit I’m not sure if only using “reasonable” formats would cut down the possibilities into the computable realm (obviously depends on definitions of reasonable, though part of me feels like you could (with a lot of work) actually have an objective likeliness score of various encodings). But certainly it’s a lot harder to say that it isn’t than just saying “f(x) = (63 pick x), grows very fast.”
Though, since I don’t have a good sense for whether “reasonable” ones would be a more computable number, I should update in your direction. (I tried to look into something sort of analogous, and the most common 200 passwords cover a little over 4% of all used passwords, which, isn’t large enough for me to feel comfortable expecting that the most “likely” 1,000 formats would cover a significant quantity of the probability space, or anything.)
(Also potentially important. Modern neural nets don’t really receive things as a string of bits, but instead as a string of numbers, nicely split up into separate nodes. (yes, those numbers are made of bits, but they’re floating point numbers, and the way neural nets interact with them is through all the floating point operations, so I don’t think the neural net actually touches the bit representation of the number in any meaningful way.)
“you’re jumping to the conclusion that you can reliably differentiate between...”
I think you absolutely can, and the idea was already described earlier.
You pay attention to regularities in the data. In most non-random images, pixels near to each other are similar. In an MxN image, the pixel below is a[i+M], whereas in an NxM image, it’s a[i+N]. If, across the whole image, the difference between a[i+M] is less than the difference between a[i+N], it’s more likely an MxN image. I expect you could find the resolution by searching all possible resolutions from 1x<length> to <length>x1, and finding which minimizes average distance of “adjacent” pixels.
Similarly (though you’d likely do this first), you can tell the difference between RGB and RGBA. If you have (255, 0, 0, 255, 0, 0, 255, 0, 0, 255, 0, 0), this is probably 4 red pixels in RGB, and not a fully opaque red pixel, followed by a fully transparent green pixel, followed by a fully transparent blue pixel in RGBA. It could be 2 pixels that are mostly red and slightly green in 16 bit RGB, though. Not sure how you could piece that out.
Aliens would probably do a different encoding. We don’t know what the rods and cones in their eye-equivalents are, and maybe they respond to different colors. Maybe it’s not Red Green Blue, but instead Purple Chartreuse Infrared. I’m not sure this matters. It just means your eggplants look red.
I think, even if it had 5 (or 6, or 20) channels, this regularity would be born out, between bit i and bit i+5 being less than bit i and i+1, 2, 3, or 4.
Now, there’s still a lot that that doesn’t get you yet. But given that there are ways to figure out those, I kinda think I should have decent expectations there’s ways to figure out other things, too, even if I don’t know them.
I do also think it’s important to zoom out to the original point. Eliezer posed this as an idea about AGI. We currently sometimes feed images to our AIs, and when we do, we feed them as raw RGB data, not encoded, because we know that would make it harder for the AI to figure out. I think it would be very weird, if we were trying to train an AI, to send it compressed video, and much more likely that we do, in fact, send it raw RGB values frame by frame.
I will also say that the original claim (by Eliezer, not the top of this thread), was not physics from one frame, but physics from like, 3 frames, so you get motion, and acceleration. 4 frames gets you to third derivatives, which, in our world, don’t matter that much. Having multiple frames also aids in ideas like the 3d → 2d projection, since motion and occlusion are hints at that.
And I think the whole question is “does this image look reasonable”, which you’re right, is not a rigorous mathematical concept. But “‘looks reasonable’ is not a rigorous concept” doesn’t get followed by “and therefore is impossible” Above are some of the mathematical descriptions of what ‘reasonable’ means in certain contexts. Rendering a 100x75 image as 75x100 will not “look reasonable”. But it’s not beyond thinking and math to determine what you mean by that.
The core problem remains computational complexity.
Statements like “does this image look reasonable” or saying “you pay attention to regularities in the data”, or “find the resolution by searching all possible resolutions” are all hiding high computational costs behind short English descriptions.
Let’s consider the case of a 1280x720 pixel image.
That’s the same as 921600 pixels.
How many bytes is that?
It depends. How many bytes per pixel?[1] In my post, I explained there could be 1-byte-per-pixel grayscale, or perhaps 3-bytes-per-pixel RGB using [0, 255] values for each color channel, or maybe 6-bytes-per-pixel with [0, 65535] values for each color channel, or maybe something like 4-bytes-per-pixel because we have 1-byte RGB channels and a 1-byte alpha channel.
Let’s assume that a reasonable cutoff for how many bytes per pixel an encoding could be using is say 8 bytes per pixel, or a hypothetical 64-bit color depth.
How many ways can we divide this between channels?
If we assume 3 channels, it’s 1953.
If we assume 4 channels, it’s 39711.
Also if it turns out to be 5 channels, it’s 595665.
This is a pretty fast growing function. The following is a plot.
Note that the red line is
O(2^N)
and the black line barely visible at the bottom isO(N^2)
.N^2
is a notorious runtime complexity because it’s right on the threshold of what is generally unacceptable performance.[2]Let’s hope that this file isn’t actually a frame buffer from a graphics card with 32 bits per channel or a 128 bit per pixel / 16 byte per pixel.
Unfortunately, we still need to repeat this calculation for all of the possibilities for how many bits per pixel this image could be. We need to add in the possibility that it is 63 bits per pixel, or 62 bits per pixel, or 61 bits per pixel.
In case anyone wants to claim this is unreasonable, it’s not impossible to have image formats that have RGBA data, but only 1 bit associated with the alpha data for each pixel. [3]
And for each of these scenarios, we need to question how many channels of color data there are.
1? Grayscale.
2? Grayscale, with an alpha channel maybe?
3? RGB, probably, or something like HSV.
4? RGBA, or maybe it’s the RGBG layout I described for a RAW encoding of a Bayer filter, or maybe it’s CMYK for printing.
5? This is getting weird, but it’s not impossible. We could be encoding additional metadata into each pixel, e.g. distance from the camera.
6? Actually, this question how how many channels there are is very important, given the fast growing function above.
7? This one question, if we don’t know the right answer, is sufficient to make this algorithm pretty much impossible to run.
8? When we say we can try all of options, that’s not actually possible.
9? What I think people mean is that we can use heuristics to pick the likely options first and try them, and then fall back to more esoteric options if the initial results don’t make sense.
10? That’s the difference between average run-time and worst case run-time.
11? The point that I am trying to make is that the worst case run-time for decoding an arbitrary binary file is pretty much unbounded, because there’s a ridiculous amount of choice possible.
12? Some examples of “image” formats that have large numbers of channels per “pixel” are things like RADAR / LIDAR sensors, e.g. it’s possible to have 5 channels per pixel for defining 3D coordinates (relative to the sensor), range, and intensity.
You actually ran into this problem yourself.
Summing up all of the possibilities above is left as an exercise for the reader, and we’ll call that sum
K
.Without loss of generality, let’s say our image was encoded as 3 bytes per pixel divided between 3 RGB color channels of 1 byte each.
Our 1280x720 image is actually 2764800 bytes as a binary file.
But since we’re decoding it from the other side, and we don’t know it’s 1280x720, when we’re staring at this pile of 2764800 bytes, we need to first assume how many bytes per pixel it is, so that we can divide the total bytes by the bytes per pixel to calculate the number of pixels.
Then, we need to test each possible resolutions as you’ve suggested.
The number of possible resolutions is the same as the number of divisors of the number of pixels. The equation for providing an upper bound is
exp(log(N)/log(log(N)))
[4], but the average number of divisors is approximatelylog(N)
.Oops, no it isn’t!
Files have headers! How large is the header? For a bitmap, it’s anywhere between 26 and 138 bytes. The JPEG header is at least 2 bytes. PNG uses 8 bytes. GIF uses at least 14 bytes.
Now we need to make the following choices:
Guess at how many bytes per pixel the data is.
Guess at the length of the header. (maybe it’s 0, there is no header!)
Calculate the factorization of the remaining bytes N for the different possible resolutions.
Hope that there isn’t a footer, checksum, or any type of other metadata hanging out in the sea of bytes. This is common too!
Once we’ve made our choices above, then we multiply that by
log(N)
for the number of resolutions to test, and then we’ll apply the suggested metric. Remember that when considering the different pixel formats and ways the color channel data could be represented, the number wasK
, and that’s what we’re multiplying bylog(N)
.What you’re describing here is actually similar to a common metric used in algorithms for automatically focusing cameras by calculating the contrast of an image, except for focusing you want to maximize contrast instead of minimize it.
The interesting problem with this metric is that it’s basically a one-way function. For a given image, you can compute this metric. However, minimizing this metric is not the same as knowing that you’ve decoded the image correctly. It says you’ve found a decoding, which did minimize the metric. It does not mean that is the correct decoding.
A trivial proof:
Consider an image and the reversal of that image along the horizontal axis.
These have the same metric.
So the same metric can yield two different images.
A slightly less trivial proof:
For a given “image” of
N
bytes of image data, there are2^(N*8)
possible bit patterns.Assuming the metric is calculated as an 8-byte IEEE 754 double, there are only
2^(8*8)
possible bit patterns.When
N > 8
, there are more bit patterns than values allowed in a double, so multiple images need to map to the same metric.The difference between our
2^(2764800*8)
image space and the2^64
metric is, uhhh,10^(10^6.8)
. Imagine10^(10^6.8)
pigeons. What a mess.[5]The metric cannot work as described. There will be various arbitrary interpretations of the data possible to minimize this metric, and almost all of those will result in images that are definitely not the image that was actually encoded, but did minimize the metric. There is no reliable way to do this because it isn’t possible. When you have a pile of data, and you want to reverse meaning from it, there is not one “correct” message that you can divine from it.[6] See also: numerology, for an example that doesn’t involve binary file encodings.
Even pretending that this metric did work, what’s the time complexity of it? We have to check each pixel, so it’s
O(N)
. There’s a constant factor for each pixel computation. How large is that constant? Let’s pretend it’s small and ignore it.So now we’ve got
K*O(N*log(N))
which is the time complexity of lots of useful algorithms, but we’ve got that awkward constantK
in the front. Remember that the constantK
reflects the number of choices for different bits per pixel, bits per channel, and the number of channels of data per pixel. Unfortunately, that constant is the one that was growing a rate best described as “absurd”. That constant is the actual definition of what it means to have no priors. When I said “you can generate arbitrarily many hypotheses, but if you don’t control what data you receive, and there’s no interaction possible, then you can’t rule out hypotheses”, what I’m describing is this constant.What I care about is the difference between:
1. Things that are computable.
2. Things that are computable efficiently.
These sets are not the same.
Capabilities of a superintelligent AGI lie only in the second set, not the first.
It is important to understand that a superintelligent AGI is not brute forcing this in the way that has been repeatedly described in this thread. Instead the superintelligent AGI is going to use a bunch of heuristics or knowledge about the provenance of the binary file, combined with access to the internet so that it can just lookup the various headers and features of common image formats, and it’ll go through and check all of those, and then if it isn’t any of the usual suspects, it’ll throw up metaphorical hands, and concede defeat. Or, to quote the title of this thread, intelligence isn’t magic.
This is often phrased as bits per pixel, because a variety of color depth formats use less than 8 bits per channel, or other non-byte divisions.
Refer to https://accidentallyquadratic.tumblr.com/ for examples.
A fun question to consider here becomes: where are the alpha bits stored? E.g. if we assume 3 bytes for RGB data, and then we have the 1 alpha bit, is each pixel taking up 9 bits, or are the pixels stored in runs of 8 pixels followed by a single “alpha” pixel with 8 bits describing the alpha channels of the previous 8 pixels?
https://terrytao.wordpress.com/2008/09/23/the-divisor-bound/
https://en.wikipedia.org/wiki/Pigeonhole_principle
The way this works for real reverse engineering is that we already have expectations of what the data should look like, and we are tweaking inputs and outputs until we get the data we expected. An example would be figuring out a camera’s RAW format by taking pictures of carefully chosen targets like an all white wall, or a checkerboard wall, or an all red wall, and using the knowledge of those targets to find patterns in the data that we can decode.
So, I want to note a few things. The original Eliezer post was intended to argue against this line of reasoning:
He didn’t worry about compute, because that’s not a barrier on the theoretical limit. And in his story, the entire human civilization had decades to work on this problem.
But you’re right, in a practical world, compute is important.
I feel like you’re trying to make this take as much compute as possible.
Since you talked about headers, I feel like I need to reiterate that, when we are talking to a neural network, we do not add the extra data. The goal is to communicate with the neural network, so we intentionally put it in easier to understand formats.
In the practical cases for this to come up (e.g. a nascent superintelligence figuring out physics faster than we expect), we probably will also be inputting data in an easy to understand format.
Similarly, I expect you don’t need to check every possible esoteric format. The likelihood of the image using an encoding like 61 bits per pixel, with 2 for red, 54 for green and 5 for blue is just, very low, a priori. I do admit I’m not sure if only using “reasonable” formats would cut down the possibilities into the computable realm (obviously depends on definitions of reasonable, though part of me feels like you could (with a lot of work) actually have an objective likeliness score of various encodings). But certainly it’s a lot harder to say that it isn’t than just saying “f(x) = (63 pick x), grows very fast.”
Though, since I don’t have a good sense for whether “reasonable” ones would be a more computable number, I should update in your direction. (I tried to look into something sort of analogous, and the most common 200 passwords cover a little over 4% of all used passwords, which, isn’t large enough for me to feel comfortable expecting that the most “likely” 1,000 formats would cover a significant quantity of the probability space, or anything.)
(Also potentially important. Modern neural nets don’t really receive things as a string of bits, but instead as a string of numbers, nicely split up into separate nodes. (yes, those numbers are made of bits, but they’re floating point numbers, and the way neural nets interact with them is through all the floating point operations, so I don’t think the neural net actually touches the bit representation of the number in any meaningful way.)