“Pure Occam’s Razor” is roughly how I would describe it too. I suspect the difference in our mental models is one of “how far can Occam’s Razor take you?”. My suspicion is that “how well you can predict the next bit in a stream of data” and “how well you understand a stream of data” are, in most¹ cases, the same thing.
In terms of concrete predictions, I’d expect that if we
Had someone² generate a description of a physical scene in a universe that runs on different laws than ours, recorded by sensors that use a different modality than we use
Had someone code up a simulation of what sensor data with a small amount of noise would look like in that scenario and dump it out to a file
Created a substantial prize structured mostly³ like the Hutter Prize for compressed file + decompressor
we would see that the winning programs would look more like “generate a model and use that model and a similar rendering process to what was used to original file, plus an error correction table” and less like a general-purpose compressor⁴.
¹ But I probably wouldn’t go so far as to say “all”—if you handed me a stream of bits corresponding to concat(sha1(concat(secret, "1")), sha1(concat(secret, "2")), sha1(concat(secret, "3")), ...sha1(concat(secret, "999999"))), and I knew everything about the process of how you generated that stream of bits except what the value of secret was, I would say that I have a nonzero amount of understanding of that stream of bits despite having zero ability to predict the next bit in the sequence given all previous bits.
² Maybe Greg Egan, he seems to be pretty good at the “what would it be like to live in a universe with different underlying physics” thing
³ Ideally minus the bit about the compression and decompression algos having to work with such limited resources
⁴ Though I will note that, counter to my expectation here, as far as I know modern lossless photo compression does not look like “build a simulation of some objects in space, fine tune, and apply a correction layer” despite this being what my model of the world predicting humanity would create in a world where computation is cheap and bandwidth is expensive.
If I gave you a random binary file with no file extension and no metadata, how confident are you that you can tell me the entire provenance of that information?
Extremely confident you could construct a file such that I could not even start giving you the provenance of that information. Trivially, you could do something like “aes256 encrypt a 2^20 bytes of zeros with the passphrase ec738958c3e7c2eeb3b4”.
My contention is not so much “intelligence is extremely powerful” as it is “an image taken by a modern camera contains a surprisingly large amount of easy-to-decode information about the world, in the information-theoretic sense of the word information”.
If you were to give me a binary file with no extension and no metadata that is
Above 1,000,000 bytes in size
Able to be compressed to under 50% of its uncompressed size with some simple tool like gzip (to ensure that there is actually some discoverable structure)
Not able to be compressed under 10% of its uncompressed size by any well-known existing tools (to ensure that there is actually a meaningful amount of information in the file)
Not generated by some tricky gotcha process (e.g. a file that is 250,000 bytes from /dev/random followed by 750,000 bytes from /dev/zero)
then I’d expect that
It would be possible for me, given some time to examine the data, create a decompressor and a payload such that running the decompressor on the payload yields the original file, and the decompressor program + the payload have a total size of less than the original gzipped file
The decompressor would legibly contain a substantial amount of information about the structure of the data.
I would not expect that I would be able to tell you the entire provenance of the information, even if it were possible in principle to deduce from the file, since I am not a superintelligence or even particularly smart by human standards.
I notice that this is actually something that could be empirically tested—if you wanted we could actually run this experiment at some point (I anticipate being pretty busy for the next few weeks, but on the weekend of July 16⁄17 I will be tied to my computer for the duration of the weekend (on-call for work), so if you have significantly different expectations about what would happen, and want to actually run this experiment, that would be a good time to do it for me.
P.S. A lot of my expectations here come from the Hutter Prize, a substantial cash prize for generating good compressions of the first gigabyte of Wikipedia. The hypothesis behind the prize is that, in order to compress data, it is helpful to understand that data. The current highest-performing algorithm does not have source code available, but one of the best runner-ups that actually gives source code works using a process that looks like
The transform done by paq8hp1 through paq8hp5 is based on WRT by Przemyslaw Skibinski, which first appeared in PAsQDa and paqar, and later in paq8g and xml-wrt. The steps are as follows:
The input is parsed into seqences of all uppercase letters or all lowercase letters, or one uppercase letter followed by lowercase letters, e.g. “THE”, “the”, or “The”.
All uppercase words are prefixed by a special symbol (0E hex in paq8hp3, paq8hp4, paq8hp5). If a lowercase letter follows with no intervening characters (e.g. “THEre”, then a special symbol (0C hex) marks the end. (e.g. 0E “the” 0C “re”).
Capitalized words are prefixed with 7F hex (paq8hp3) or 40 hex (paq8hp4, paq8hp5) (e.g. “The” → 40 “the”).
All letters are converted to lower case.
Words are looked up in the dictionary. The first 80 words in the dictionary are coded with 1 byte: 80, 81, … CF (hex).
The next 2560 words (paq8hp1-4) or 3840 words (paq8hp5) are coded with 2 bytes: D080, D081, … EFCF (paq8hp1-4), or D080, … FFCF (paq8hp5).
The last 40960 words are coded with 3 bytes: F0D080, F0D081, … FFEFCF.
If a word does not match, then the longest matching prefix with length at least 6 is coded and the rest of the word is spelled.
If there is no matching prefix, then the longest matching suffix with length at least 6 is coded after spelling the preceding letters.
If no matching word, prefix, or suffix is found, the word is spelled. Capitalization coding occurs regardless.
Any input bytes with special meaning are escaped by prefixing with 06: 06, 0C, 0E, 40 or 7F, 80-FF.
WRT has additional capabilities depending on input, such as skipping encoding if little or no text is detected. The dictionary format is one word per line (linefeed only) with a 13 line header.
If there was a future entry into the Hutter Prize which substantially cut down the compressed size, and that future entry contained fewer rather than more rules about how the English language worked, I would be very surprised and consider that to go a long way towards invalidating my model of what it means to “understand” something.
contains actual, real information (it is not just random noise)
it’ll compress easily (there’s structure to it)
it’s totally useless to you (you can derive structure, but not meaning)
It’s 1 megabyte of telemetry captured from a real-time hardware/software system using a binary encoding for the message frames. The telemetry is, in other words, not self-describing.
Some facts:
A trivial compressor, e.g. zip, can compress this telemetry to a much smaller size.
It can also decompress the telemetry.
The size of the compressed file + the size of the compressor is less than the uncompressed telemetry.
However:
You cannot make any conclusions about the meaning of this data, even after you derive some type of structure from it.
Knowing that a 4 byte sequence starting at data[N * 20] for each integer N can be delta-encoded such that on average the delta-encoding of data[(N-1) * 20] saves more space than including both data[(N-1) * 20] and data[N * 20] does not tell you what those 4 bytes mean.
You can argue that maybe it’s a 4 byte value of some type. But you don’t know if it’s stored as Big Endian or Little Endian in this format.
You also don’t know if it is signed or unsigned, or if it is even a 2′s complement integer. It could be an IEEE 754 32-bit floating point value.
It could be a contiguous run of 2 separate 2 byte values that for some reason are correlated in such a way that they can be delta-encoded together efficiently, e.g. the sensor readings of redundant sensors measuring the same physical process, like a temperature.
Let’s say that for some reason you’re certain it is representing an unsigned 4-byte integer. What does it mean?
Is it the value of some sensor, as a rounded integer, like a distance measured in meters where we drop the fractional part?
Or are we actually using some type of encoding where we record the sensor value as a integer, but we have a calibration table we apply when post-processing this table to calculate a real physical value, like the look-up table used for thermocouples when interpreting the raw count measured on an ADC?
If it’s the latter and you don’t know what look-up table to use (what type of thermocouple is it? Type-K? Type-J?), good luck. Ditto for encoding where the actual physical value is calculated using coefficients for some linear or exponential transform, including an offset.
The same arguments apply if we try to assume it is a floating point value.
If you think this is being unfair because the original question was about image frames, the encoding that has been suggested in the comments for what image data looks like is describing a bitmap—a very simple, trivial way to encode image data, and incredibly inefficient in size for that reason. Even in the case of a bitmap, you’re jumping to the conclusion that you can reliably differentiate between an MxN image of 8-bit-per-channel RGB data vs an NxM image of 8-bit-per-channel RGB data, when the alternative might be that is some different size of image using say 16-bit-per-channel RGB data for higher resolution in colors, or maybe it is actually 8-bit-per-channel RGBA data because we’re including the alpha component for each pixel, or maybe the data is actually stored as HSL channels, or maybe it’s stored as BGR instead of RGB, or maybe the data is actually an MxN image of a 8-bit-per-channel grayscale data, or perhaps 24-bit-per-channel grayscale data. In the latter cases, there are not 3 values per pixel, because it’s grayscale! Not to be confused with storing a grayscale image in RGB channels, e.g. by repeating the same value for all 3 channels.
And all of the above is ignoring the elephant in the room: Video codecs are not a series of bitmaps. They are compressed and almost always use inter-frame compression meaning that individual video frames are compressed using knowledge of the previous video frames. Therefore, the actual contents of a video does not look like “a series of frames where each frame is a grid of cells where each cell is 3 values”. Likewise, most images on the internet are JPGs and PNGs, which are also compressed, and do not look like a grid of cells of three values.
Ok, but what about cameras?
“an image taken by a modern camera contains a surprisingly large amount of easy-to-decode information about the world, in the information-theoretic sense of the word information”
Cameras used by professional photographers often dump data in a RAW format which may or may not be compressed but is specific to the manufacturer of the camera because it’s tied to the actual camera hardware in the same way that my hypothetical telemetry is tied to the hardware/software system.
Providing a detailed and concise description of the content of raw files is highly problematic. There is no single raw format; formats can be similar or radically different. Different manufacturers use their own proprietary and typically undocumented formats, which are collectively known as raw format. Often they also change the format from one camera model to the next. Several major camera manufacturers, including Nikon, Canon and Sony, encrypt portions of the file in an attempt to prevent third-party tools from accessing them.
Is that “easy to decode”?
There is an argument here that it is still easy because all you need to do is “just” run through all of the various permutations I’ve described above until at the end of the process there is an image that “looks like” a reasonable image. I mean if you dump the color data and it looks like blue and red are swapped, maybe it was stored BGR instead of RGB, and there’s no harm there, right? And now we’ve buried the whole argument with a question of what does it mean for a result to look “reasonable”? When you’re given a totally unknown encoding and you want to decode it, and you have to make assumptions about what the parsed data is going to look like just to evaluate the strength of your decoding, is that very solid ground? Are you certain that the algorithm being described here is “easy”, in the sense that is computationally efficient?
“you’re jumping to the conclusion that you can reliably differentiate between...”
I think you absolutely can, and the idea was already described earlier.
You pay attention to regularities in the data. In most non-random images, pixels near to each other are similar. In an MxN image, the pixel below is a[i+M], whereas in an NxM image, it’s a[i+N]. If, across the whole image, the difference between a[i+M] is less than the difference between a[i+N], it’s more likely an MxN image. I expect you could find the resolution by searching all possible resolutions from 1x<length> to <length>x1, and finding which minimizes average distance of “adjacent” pixels.
Similarly (though you’d likely do this first), you can tell the difference between RGB and RGBA. If you have (255, 0, 0, 255, 0, 0, 255, 0, 0, 255, 0, 0), this is probably 4 red pixels in RGB, and not a fully opaque red pixel, followed by a fully transparent green pixel, followed by a fully transparent blue pixel in RGBA. It could be 2 pixels that are mostly red and slightly green in 16 bit RGB, though. Not sure how you could piece that out.
Aliens would probably do a different encoding. We don’t know what the rods and cones in their eye-equivalents are, and maybe they respond to different colors. Maybe it’s not Red Green Blue, but instead Purple Chartreuse Infrared. I’m not sure this matters. It just means your eggplants look red.
I think, even if it had 5 (or 6, or 20) channels, this regularity would be born out, between bit i and bit i+5 being less than bit i and i+1, 2, 3, or 4.
Now, there’s still a lot that that doesn’t get you yet. But given that there are ways to figure out those, I kinda think I should have decent expectations there’s ways to figure out other things, too, even if I don’t know them.
I do also think it’s important to zoom out to the original point. Eliezer posed this as an idea about AGI. We currently sometimes feed images to our AIs, and when we do, we feed them as raw RGB data, not encoded, because we know that would make it harder for the AI to figure out. I think it would be very weird, if we were trying to train an AI, to send it compressed video, and much more likely that we do, in fact, send it raw RGB values frame by frame.
I will also say that the original claim (by Eliezer, not the top of this thread), was not physics from one frame, but physics from like, 3 frames, so you get motion, and acceleration. 4 frames gets you to third derivatives, which, in our world, don’t matter that much. Having multiple frames also aids in ideas like the 3d → 2d projection, since motion and occlusion are hints at that.
And I think the whole question is “does this image look reasonable”, which you’re right, is not a rigorous mathematical concept. But “‘looks reasonable’ is not a rigorous concept” doesn’t get followed by “and therefore is impossible” Above are some of the mathematical descriptions of what ‘reasonable’ means in certain contexts. Rendering a 100x75 image as 75x100 will not “look reasonable”. But it’s not beyond thinking and math to determine what you mean by that.
The core problem remains computational complexity.
Statements like “does this image look reasonable” or saying “you pay attention to regularities in the data”, or “find the resolution by searching all possible resolutions” are all hiding high computational costs behind short English descriptions.
Let’s consider the case of a 1280x720 pixel image. That’s the same as 921600 pixels.
How many bytes is that?
It depends. How many bytes per pixel?[1] In my post, I explained there could be 1-byte-per-pixel grayscale, or perhaps 3-bytes-per-pixel RGB using [0, 255] values for each color channel, or maybe 6-bytes-per-pixel with [0, 65535] values for each color channel, or maybe something like 4-bytes-per-pixel because we have 1-byte RGB channels and a 1-byte alpha channel.
Let’s assume that a reasonable cutoff for how many bytes per pixel an encoding could be using is say 8 bytes per pixel, or a hypothetical 64-bit color depth.
How many ways can we divide this between channels?
If we assume 3 channels, it’s 1953. If we assume 4 channels, it’s 39711. Also if it turns out to be 5 channels, it’s 595665.
This is a pretty fast growing function. The following is a plot.
Note that the red line is O(2^N) and the black line barely visible at the bottom is O(N^2). N^2 is a notorious runtime complexity because it’s right on the threshold of what is generally unacceptable performance.[2]
Let’s hope that this file isn’t actually a frame buffer from a graphics card with 32 bits per channel or a 128 bit per pixel / 16 byte per pixel.
Unfortunately, we still need to repeat this calculation for all of the possibilities for how many bits per pixel this image could be. We need to add in the possibility that it is 63 bits per pixel, or 62 bits per pixel, or 61 bits per pixel.
In case anyone wants to claim this is unreasonable, it’s not impossible to have image formats that have RGBA data, but only 1 bit associated with the alpha data for each pixel. [3]
And for each of these scenarios, we need to question how many channels of color data there are.
1? Grayscale.
2? Grayscale, with an alpha channel maybe?
3? RGB, probably, or something like HSV.
4? RGBA, or maybe it’s the RGBG layout I described for a RAW encoding of a Bayer filter, or maybe it’s CMYK for printing.
5? This is getting weird, but it’s not impossible. We could be encoding additional metadata into each pixel, e.g. distance from the camera.
6? Actually, this question how how many channels there are is very important, given the fast growing function above.
7? This one question, if we don’t know the right answer, is sufficient to make this algorithm pretty much impossible to run.
8? When we say we can try all of options, that’s not actually possible.
9? What I think people mean is that we can use heuristics to pick the likely options first and try them, and then fall back to more esoteric options if the initial results don’t make sense.
10? That’s the difference between average run-time and worst case run-time.
11? The point that I am trying to make is that the worst case run-time for decoding an arbitrary binary file is pretty much unbounded, because there’s a ridiculous amount of choice possible.
12? Some examples of “image” formats that have large numbers of channels per “pixel” are things like RADAR / LIDAR sensors, e.g. it’s possible to have 5 channels per pixel for defining 3D coordinates (relative to the sensor), range, and intensity.
You actually ran into this problem yourself.
Similarly (though you’d likely do this first), you can tell the difference between RGB and RGBA. If you have (255, 0, 0, 255, 0, 0, 255, 0, 0, 255, 0, 0), this is probably 4 red pixels in RGB, and not a fully opaque red pixel, followed by a fully transparent green pixel, followed by a fully transparent blue pixel in RGBA. It could be 2 pixels that are mostly red and slightly green in 16 bit RGB, though. Not sure how you could piece that out.
Summing up all of the possibilities above is left as an exercise for the reader, and we’ll call that sum K.
Without loss of generality, let’s say our image was encoded as 3 bytes per pixel divided between 3 RGB color channels of 1 byte each.
Our 1280x720 image is actually 2764800 bytes as a binary file.
But since we’re decoding it from the other side, and we don’t know it’s 1280x720, when we’re staring at this pile of 2764800 bytes, we need to first assume how many bytes per pixel it is, so that we can divide the total bytes by the bytes per pixel to calculate the number of pixels.
Then, we need to test each possible resolutions as you’ve suggested.
The number of possible resolutions is the same as the number of divisors of the number of pixels. The equation for providing an upper bound is exp(log(N)/log(log(N)))[4], but the average number of divisors is approximately log(N).
Oops, no it isn’t!
Files have headers! How large is the header? For a bitmap, it’s anywhere between 26 and 138 bytes. The JPEG header is at least 2 bytes. PNG uses 8 bytes. GIF uses at least 14 bytes.
Now we need to make the following choices:
Guess at how many bytes per pixel the data is.
Guess at the length of the header. (maybe it’s 0, there is no header!)
Calculate the factorization of the remaining bytes N for the different possible resolutions.
Hope that there isn’t a footer, checksum, or any type of other metadata hanging out in the sea of bytes. This is common too!
Once we’ve made our choices above, then we multiply that by log(N) for the number of resolutions to test, and then we’ll apply the suggested metric. Remember that when considering the different pixel formats and ways the color channel data could be represented, the number was K, and that’s what we’re multiplying by log(N).
In most non-random images, pixels near to each other are similar. In an MxN image, the pixel below is a[i+M], whereas in an NxM image, it’s a[i+N]. If, across the whole image, the difference between a[i+M] is less than the difference between a[i+N], it’s more likely an MxN image. I expect you could find the resolution by searching all possible resolutions from 1x<length> to <length>x1, and finding which minimizes average distance of “adjacent” pixels.
What you’re describing here is actually similar to a common metric used in algorithms for automatically focusing cameras by calculating the contrast of an image, except for focusing you want to maximize contrast instead of minimize it.
The interesting problem with this metric is that it’s basically a one-way function. For a given image, you can compute this metric. However, minimizing this metric is not the same as knowing that you’ve decoded the image correctly. It says you’ve found a decoding, which did minimize the metric. It does not mean that is the correct decoding.
A trivial proof:
Consider an image and the reversal of that image along the horizontal axis.
These have the same metric.
So the same metric can yield two different images.
A slightly less trivial proof:
For a given “image” of N bytes of image data, there are 2^(N*8) possible bit patterns.
Assuming the metric is calculated as an 8-byte IEEE 754 double, there are only 2^(8*8) possible bit patterns.
When N > 8, there are more bit patterns than values allowed in a double, so multiple images need to map to the same metric.
The difference between our 2^(2764800*8) image space and the 2^64 metric is, uhhh, 10^(10^6.8). Imagine 10^(10^6.8) pigeons. What a mess.[5]
The metric cannot work as described. There will be various arbitrary interpretations of the data possible to minimize this metric, and almost all of those will result in images that are definitely not the image that was actually encoded, but did minimize the metric. There is no reliable way to do this because it isn’t possible. When you have a pile of data, and you want to reverse meaning from it, there is not one “correct” message that you can divine from it.[6] See also: numerology, for an example that doesn’t involve binary file encodings.
Even pretending that this metric did work, what’s the time complexity of it? We have to check each pixel, so it’s O(N). There’s a constant factor for each pixel computation. How large is that constant? Let’s pretend it’s small and ignore it.
So now we’ve got K*O(N*log(N)) which is the time complexity of lots of useful algorithms, but we’ve got that awkward constant K in the front. Remember that the constant K reflects the number of choices for different bits per pixel, bits per channel, and the number of channels of data per pixel. Unfortunately, that constant is the one that was growing a rate best described as “absurd”. That constant is the actual definition of what it means to have no priors. When I said “you can generate arbitrarily many hypotheses, but if you don’t control what data you receive, and there’s no interaction possible, then you can’t rule out hypotheses”, what I’m describing is this constant.
I think it would be very weird, if we were trying to train an AI, to send it compressed video, and much more likely that we do, in fact, send it raw RGB values frame by frame.
What I care about is the difference between:
1. Things that are computable. 2. Things that are computable efficiently.
These sets are not the same. Capabilities of a superintelligent AGI lie only in the second set, not the first.
It is important to understand that a superintelligent AGI is not brute forcing this in the way that has been repeatedly described in this thread. Instead the superintelligent AGI is going to use a bunch of heuristics or knowledge about the provenance of the binary file, combined with access to the internet so that it can just lookup the various headers and features of common image formats, and it’ll go through and check all of those, and then if it isn’t any of the usual suspects, it’ll throw up metaphorical hands, and concede defeat. Or, to quote the title of this thread, intelligence isn’t magic.
A fun question to consider here becomes: where are the alpha bits stored? E.g. if we assume 3 bytes for RGB data, and then we have the 1 alpha bit, is each pixel taking up 9 bits, or are the pixels stored in runs of 8 pixels followed by a single “alpha” pixel with 8 bits describing the alpha channels of the previous 8 pixels?
The way this works for real reverse engineering is that we already have expectations of what the data should look like, and we are tweaking inputs and outputs until we get the data we expected. An example would be figuring out a camera’s RAW format by taking pictures of carefully chosen targets like an all white wall, or a checkerboard wall, or an all red wall, and using the knowledge of those targets to find patterns in the data that we can decode.
So, I want to note a few things. The original Eliezer post was intended to argue against this line of reasoning:
I occasionally run into people who say something like, “There’s a theoretical limit on how much you can deduce about the outside world, given a finite amount of sensory data.”
He didn’t worry about compute, because that’s not a barrier on the theoretical limit. And in his story, the entire human civilization had decades to work on this problem.
But you’re right, in a practical world, compute is important.
I feel like you’re trying to make this take as much compute as possible.
Since you talked about headers, I feel like I need to reiterate that, when we are talking to a neural network, we do not add the extra data. The goal is to communicate with the neural network, so we intentionally put it in easier to understand formats.
In the practical cases for this to come up (e.g. a nascent superintelligence figuring out physics faster than we expect), we probably will also be inputting data in an easy to understand format.
Similarly, I expect you don’t need to check every possible esoteric format. The likelihood of the image using an encoding like 61 bits per pixel, with 2 for red, 54 for green and 5 for blue is just, very low, a priori. I do admit I’m not sure if only using “reasonable” formats would cut down the possibilities into the computable realm (obviously depends on definitions of reasonable, though part of me feels like you could (with a lot of work) actually have an objective likeliness score of various encodings). But certainly it’s a lot harder to say that it isn’t than just saying “f(x) = (63 pick x), grows very fast.”
Though, since I don’t have a good sense for whether “reasonable” ones would be a more computable number, I should update in your direction. (I tried to look into something sort of analogous, and the most common 200 passwords cover a little over 4% of all used passwords, which, isn’t large enough for me to feel comfortable expecting that the most “likely” 1,000 formats would cover a significant quantity of the probability space, or anything.)
(Also potentially important. Modern neural nets don’t really receive things as a string of bits, but instead as a string of numbers, nicely split up into separate nodes. (yes, those numbers are made of bits, but they’re floating point numbers, and the way neural nets interact with them is through all the floating point operations, so I don’t think the neural net actually touches the bit representation of the number in any meaningful way.)
I don’t see the file, but again if you want to run this the mentioned dates from the other thread work best.
I am not exactly sure what you mean by the phrase “derive structure but not meaning”—a couple possibilities come to mind.
Let’s say I have a file that is composed of 1,209,600 bytes of 0s. That file was generated by taking pressure readings, in kPa, once per second on the Hubble Space Telescope for a two week period.
If I said “this file is a sequence of 2^11 x 3^3 x 5^2 x 7 zeros”, would that be a minimal example of “deriving the structure but not the meaning”? If so I expect that situation to be pretty common.
If not, some further questions:
Is it coherent to “understand the meaning but not the structure”?
Would it be possible to go from “understands the structure but not the meaning” to “understands both” purely through receiving more data?
If not , what about through interaction
If not, what differences would you expect to observe between a world where I understood the structure but not the meaning of something and a world where I understood both?
I was describing a file that would fit your criteria but not be useful. I was explaining in bullet points all of the reasons why that file can’t be decoded without external knowledge.
I think that you understood the point though, with your example of data from the Hubble Space Telescope. One caveat: I want to be clear that the file does not have to be all zeroes. All zeroes would violate your criteria that the data cannot be compressed to less than 10% of it’s uncompressed size, since all zeroes can be trivially run-length-encoded.
But let’s look at this anyway.
You said the file is all zeroes, and it’s 1209600 bytes. You also said it’s pressure readings in kPa, taken once per second. You then said it’s 2^11 x 3^3 x 5^2 x 7 zeroes—I’m a little bit confused on where this number came from? That number is 9676800, which is larger than the file size in bytes. If I divide by 8, then I get the stated file size, so maybe you’re referring to the binary sequence of bits being either 0 or 1, and then on this hardware a byte is 8-bits, and that’s how those numbers connect.
In a trivial sense, yes, that is “deriving the structure but not the meaning”.
What I really meant was that we would struggle to distinguish between:
The file is 1209600 separate measurements, each 1-byte, taken by a single pressure sensor.
The file is 604800 measurements of 1 byte each from 2 redundant pressure sensors.
The file is 302400 measurements, each 4-bytes, taken by a single pressure sensor.
The file is 241920 measurements, each a 4-byte timestamp field and a 1-byte pressure sensor value.
Or considering some number of values, with some N-byte width:
The pressure sensor value is in kPa.
The pressure sensor value is in psi.
The pressure sensor value is in atmospheres.
The pressure sensor value is in lbf.
The pressure sensor value is in raw counts because it’s a direct ADC measurement, so it needs to be converted to the actual pressure via a linear transform.
We don’t know the ADC’s reference voltage.
Or the data sheet for this pressure sensor.
Your questions are good.
1. Is it coherent to “understand the meaning but not the structure”?
Probably not? I guess there’s like a weird “gotcha” answer to this question where I could describe what a format tells you, in words, but not show you the format itself, and maybe we could quibble that in such a scenario you’d understand “the meaning but not the structure”.
EDIT: I think I’ve changed my mind on this answer since posting—yes, there are scenarios where you would understand the meaning of something, but not necessarily the structure of it. A trivial example would be something like a video game save file. You know that some file represents your latest game, and that it allows you to continue and resume from where you left off. You know how that file was created (you pressed “Save Game”), and you know how to use it (press “Load Game”, select save game name), but without some amount of reverse-engineering, you don’t know the structure of it (assuming that the saved game is not stored as plain text). For non-video-game examples, something like a calibration file produced by some system where the system can both 1.) produce the calibration via some type of self-test or other procedure, and 2.) receive that calibration file. Or some type of system that can be configured by the user, and then you can save that configuration file to disc, so that you can upload it back to the system. Maybe you understand exactly what the configuration file will do, but you never bothered to learn the format of the file itself.
2. Would it be possible to go from “understands the structure but not the meaning” to “understands both” purely through receiving more data?
In general, no. The problem is that you can generate arbitrarily many hypotheses, but if you don’t control what data you receive, and there’s no interaction possible, then you can’t rule out hypotheses. You’d have to just get exceedingly lucky and repeatedly be given more data that is basically designed to be interpreted correctly, i.e. the data, even though it is in a binary format, is self-describing. These formats do exist by the way. It’s common for binary formats to include things like length prefixes telling you how many bytes follow some header. That’s the type of thing you wouldn’t notice with a single piece of data, but you would notice if you had a bunch of examples of data all sharing the same unknown schema.
3. If not , what about through interaction
Yes, this is how we actually reverse-engineer unknown binary formats. Almost always we have some proprietary software that can either produce the format, or read the format, and usually both. We don’t have the schema, and for the sake of argument let’s say we don’t want decompile the software. An example: video game save files that are stored using some ad-hoc binary schema.
What we generally do is start poking known values into the software and seeing what the output file looks like—like a specific date time, a naming something a certain field, or making sure some value or checkbox is entered in a specific way. Then we permute that entry and see how the file changes. The worse thing would be if almost the entire file changes, which tells us the file is either encrypted OR it’s a direct dump of RAM to disk from some data structure with undefined ordering, like a hash map, and the structure is determined by keys which we haven’t figured out yet.
Likewise, we do the reverse. We change the format and we plug it back into the software (or an external API, if we’re trying to understand how some website works). What we’re hoping for is an error message that gives us additional information—like we change some bytes and now it complains a length is invalid, that’s probably related to length. Or if we change some bytes and it says the time cannot be negative, then that might be a time related. Or we change any byte and it rejects the data as being invalid, whoops, probably encrypted again.
The key here is that we have the same problem as question 2 -- we can generate arbitrarily many hypotheses—but we have a solution now. We can design experiments and iteratively rule out hypotheses, over and over, until we figure out the actual meaning of the format—not just the structure of it, but what values actually represent.
Again, there are limits. For instance, there are reverse-engineered binary formats where the best we know is that some byte needs to be some constant value for some reason. Maybe it’s an internal software version? Who knows! We figured out the structure of that value—the byte at location 0x806 shall be 203 -- but we don’t know the meaning of it.
4. If not, what differences would you expect to observe between a world where I understood the structure but not the meaning of something and a world where I understood both?
I don’t think this algorithm could decode arbitrary data in a reasonable amount of time. I think it could decode some particularly structured types of data, and I think “fairly unprocessed sensor data from a very large number of nearly identical sensors” is one of those types of data.
My whole point is that “unprocessed sensor data” can be arbitrarily tied to hardware in ways that make it impossible to decode without knowledge of that particular hardware, e.g. ADC reference voltages, datasheets, or calibration tables.
RAW I’d expect is better than a bitmap, assuming no encryption step, on the hypothesis that more data is better.
The opposite. Bitmaps are much easier than RAW formats. A random RAW format, assuming no interaction with the camera hardware or software that can read/write that format, might as well be impossible. E.g. consider the description of a RAW format here. Would you have known that the way the camera hardware works involves pixels that are actually 4 separate color sensors arranged as (in this example) an RGBG grid (called a Bayer filter), and that to calculate the pixel colors for a bitmap you need to interpolate between 4 of those raw color sensors for each color channel on each pixel in the bitmap, and the resulting file size is going to be 3x larger? Or that the size of the output image is not the size of the sensor, so there’s dead pixels that need to be ignored during the conversion? Again, this is just describing some specific RAW format—other RAW formats work differently, because cameras don’t all use the same type of color sensor.
The point as I see it is more about whether it’s possible with a halfway reasonable amount of compute or whether That Alien Message was completely off-base.
BTW I see that someone did a very minimal version (with an array of floating point numbers generated by a pretty simple process) of this test, but I’m still up for doing a fuller test—my schedule now is a lot more clear than it was last month.
Is it possible to decode a file that was deliberately constructed to be decoded, without a priori knowledge? This is vaguely what That Alien Message is about, at least in the first part of the post where aliens are sending a message to humanity.
Is it possible to decode a file that has an arbitrary binary schema, without a priori knowledge? This is the discussion point that I’ve been arguing over with regard to stuff like decoding CAMERA raw formats, or sensor data from a hardware/software system. This is also the area where I disagree with That Alien Message—I don’t think that one-shot examples allow robust generalization.
I don’t think (1) is a particularly interesting question, because last weekend I convinced myself that the answer is yes, you can transfer data in a way that it can be decoded, with very few assumptions on the part of the receiver. I do have a file I created for this purpose. If you want, I’ll send you it.
I started creating a file for (2), but I’m not really sure how to gauge what is “fair” vs “deliberately obfuscated” in terms of encoding. I am conflicted. Even if I stick to encoding techniques I’ve seen in the real world, I feel like I can make choices on this file encoding that make the likelihood of others decoding it very low. That’s exactly what we’re arguing about on (2). However, I don’t think it will be particularly interesting or fun for people trying to decode it. Maybe that’s ok?
I’m not sure either one quite captures exactly what I mean, but I think (1) is probably closer than (2), with the caveat that I don’t think the file necessarily has to be deliberately constructed to be decoded without a-priori knowledge, but it should be constructed to have as close as possible to a 1:1 mapping between the structure of the process used to capture the data and the structure of the underlying data stream.
I notice I am somewhat confused by the inclusion of camera raw formats in (2) rather than in (1) though—I would expect that moving from a file in camera raw format to a jpeg would move you substantially in the direction from (1) to (2).
It sounds like maybe you have something resembling “some sensor data of something unusual in an unconventional but not intentionally obfuscated format”? If so, that sounds pretty much exactly like what I’m looking for.
However, I don’t think it will be particularly interesting or fun for people trying to decode it. Maybe that’s ok?
I think it’s fine if it’s not interesting or fun to decode because nobody can get a handle on the structure—if that’s the case, it will be interesting to see why we are not able to do that, and especially interesting if the file ends up looking like one of the things we would have predicted ahead of time would be decodable.
⁴ Though I will note that, counter to my expectation here, as far as I know modern lossless photo compression does not look like “build a simulation of some objects in space, fine tune, and apply a correction layer” despite this being what my model of the world predicting humanity would create in a world where computation is cheap and bandwidth is expensive.
Because computing isn’t that cheap, and bandwidth isn’t that expensive, and no one wants to get their GPU cluster running to save 500 bits of bandwidth.
“Pure Occam’s Razor” is roughly how I would describe it too. I suspect the difference in our mental models is one of “how far can Occam’s Razor take you?”. My suspicion is that “how well you can predict the next bit in a stream of data” and “how well you understand a stream of data” are, in most¹ cases, the same thing.
In terms of concrete predictions, I’d expect that if we
Had someone² generate a description of a physical scene in a universe that runs on different laws than ours, recorded by sensors that use a different modality than we use
Had someone code up a simulation of what sensor data with a small amount of noise would look like in that scenario and dump it out to a file
Created a substantial prize structured mostly³ like the Hutter Prize for compressed file + decompressor
we would see that the winning programs would look more like “generate a model and use that model and a similar rendering process to what was used to original file, plus an error correction table” and less like a general-purpose compressor⁴.
¹ But I probably wouldn’t go so far as to say “all”—if you handed me a stream of bits corresponding to
concat(sha1(concat(secret, "1")), sha1(concat(secret, "2")), sha1(concat(secret, "3")), ...sha1(concat(secret, "999999")))
, and I knew everything about the process of how you generated that stream of bits except what the value ofsecret
was, I would say that I have a nonzero amount of understanding of that stream of bits despite having zero ability to predict the next bit in the sequence given all previous bits.² Maybe Greg Egan, he seems to be pretty good at the “what would it be like to live in a universe with different underlying physics” thing
³ Ideally minus the bit about the compression and decompression algos having to work with such limited resources
⁴ Though I will note that, counter to my expectation here, as far as I know modern lossless photo compression does not look like “build a simulation of some objects in space, fine tune, and apply a correction layer” despite this being what my model of the world predicting humanity would create in a world where computation is cheap and bandwidth is expensive.
Edit: formatting
If I gave you a random binary file with no file extension and no metadata, how confident are you that you can tell me the entire provenance of that information?
Extremely confident you could construct a file such that I could not even start giving you the provenance of that information. Trivially, you could do something like “aes256 encrypt a 2^20 bytes of zeros with the passphrase ec738958c3e7c2eeb3b4”.
My contention is not so much “intelligence is extremely powerful” as it is “an image taken by a modern camera contains a surprisingly large amount of easy-to-decode information about the world, in the information-theoretic sense of the word information”.
If you were to give me a binary file with no extension and no metadata that is
Above 1,000,000 bytes in size
Able to be compressed to under 50% of its uncompressed size with some simple tool like
gzip
(to ensure that there is actually some discoverable structure)Not able to be compressed under 10% of its uncompressed size by any well-known existing tools (to ensure that there is actually a meaningful amount of information in the file)
Not generated by some tricky gotcha process (e.g. a file that is 250,000 bytes from
/dev/random
followed by 750,000 bytes from/dev/zero
)then I’d expect that
It would be possible for me, given some time to examine the data, create a decompressor and a payload such that running the decompressor on the payload yields the original file, and the decompressor program + the payload have a total size of less than the original gzipped file
The decompressor would legibly contain a substantial amount of information about the structure of the data.
I would not expect that I would be able to tell you the entire provenance of the information, even if it were possible in principle to deduce from the file, since I am not a superintelligence or even particularly smart by human standards.
I notice that this is actually something that could be empirically tested—if you wanted we could actually run this experiment at some point (I anticipate being pretty busy for the next few weeks, but on the weekend of July 16⁄17 I will be tied to my computer for the duration of the weekend (on-call for work), so if you have significantly different expectations about what would happen, and want to actually run this experiment, that would be a good time to do it for me.
P.S. A lot of my expectations here come from the Hutter Prize, a substantial cash prize for generating good compressions of the first gigabyte of Wikipedia. The hypothesis behind the prize is that, in order to compress data, it is helpful to understand that data. The current highest-performing algorithm does not have source code available, but one of the best runner-ups that actually gives source code works using a process that looks like
If there was a future entry into the Hutter Prize which substantially cut down the compressed size, and that future entry contained fewer rather than more rules about how the English language worked, I would be very surprised and consider that to go a long way towards invalidating my model of what it means to “understand” something.
Here is an example of a large file that:
contains actual, real information (it is not just random noise)
it’ll compress easily (there’s structure to it)
it’s totally useless to you (you can derive structure, but not meaning)
It’s 1 megabyte of telemetry captured from a real-time hardware/software system using a binary encoding for the message frames. The telemetry is, in other words, not self-describing.
Some facts:
A trivial compressor, e.g. zip, can compress this telemetry to a much smaller size.
It can also decompress the telemetry.
The size of the compressed file + the size of the compressor is less than the uncompressed telemetry.
However:
You cannot make any conclusions about the meaning of this data, even after you derive some type of structure from it.
Knowing that a 4 byte sequence starting at
data[N * 20]
for each integer N can be delta-encoded such that on average the delta-encoding ofdata[(N-1) * 20]
saves more space than including bothdata[(N-1) * 20]
anddata[N * 20]
does not tell you what those 4 bytes mean.You can argue that maybe it’s a 4 byte value of some type.
But you don’t know if it’s stored as Big Endian or Little Endian in this format.
You also don’t know if it is signed or unsigned, or if it is even a 2′s complement integer.
It could be an IEEE 754 32-bit floating point value.
It could be a contiguous run of 2 separate 2 byte values that for some reason are correlated in such a way that they can be delta-encoded together efficiently, e.g. the sensor readings of redundant sensors measuring the same physical process, like a temperature.
Let’s say that for some reason you’re certain it is representing an unsigned 4-byte integer.
What does it mean?
Is it the value of some sensor, as a rounded integer, like a distance measured in meters where we drop the fractional part?
Or are we actually using some type of encoding where we record the sensor value as a integer, but we have a calibration table we apply when post-processing this table to calculate a real physical value, like the look-up table used for thermocouples when interpreting the raw count measured on an ADC?
If it’s the latter and you don’t know what look-up table to use (what type of thermocouple is it? Type-K? Type-J?), good luck. Ditto for encoding where the actual physical value is calculated using coefficients for some linear or exponential transform, including an offset.
The same arguments apply if we try to assume it is a floating point value.
If you think this is being unfair because the original question was about image frames, the encoding that has been suggested in the comments for what image data looks like is describing a bitmap—a very simple, trivial way to encode image data, and incredibly inefficient in size for that reason. Even in the case of a bitmap, you’re jumping to the conclusion that you can reliably differentiate between an MxN image of 8-bit-per-channel RGB data vs an NxM image of 8-bit-per-channel RGB data, when the alternative might be that is some different size of image using say 16-bit-per-channel RGB data for higher resolution in colors, or maybe it is actually 8-bit-per-channel RGBA data because we’re including the alpha component for each pixel, or maybe the data is actually stored as HSL channels, or maybe it’s stored as BGR instead of RGB, or maybe the data is actually an MxN image of a 8-bit-per-channel grayscale data, or perhaps 24-bit-per-channel grayscale data. In the latter cases, there are not 3 values per pixel, because it’s grayscale! Not to be confused with storing a grayscale image in RGB channels, e.g. by repeating the same value for all 3 channels.
And all of the above is ignoring the elephant in the room: Video codecs are not a series of bitmaps. They are compressed and almost always use inter-frame compression meaning that individual video frames are compressed using knowledge of the previous video frames. Therefore, the actual contents of a video does not look like “a series of frames where each frame is a grid of cells where each cell is 3 values”. Likewise, most images on the internet are JPGs and PNGs, which are also compressed, and do not look like a grid of cells of three values.
Ok, but what about cameras?
Cameras used by professional photographers often dump data in a RAW format which may or may not be compressed but is specific to the manufacturer of the camera because it’s tied to the actual camera hardware in the same way that my hypothetical telemetry is tied to the hardware/software system.
Here’s the Wikipedia list of RAW formats:
Is that “easy to decode”?
There is an argument here that it is still easy because all you need to do is “just” run through all of the various permutations I’ve described above until at the end of the process there is an image that “looks like” a reasonable image. I mean if you dump the color data and it looks like blue and red are swapped, maybe it was stored BGR instead of RGB, and there’s no harm there, right? And now we’ve buried the whole argument with a question of what does it mean for a result to look “reasonable”? When you’re given a totally unknown encoding and you want to decode it, and you have to make assumptions about what the parsed data is going to look like just to evaluate the strength of your decoding, is that very solid ground? Are you certain that the algorithm being described here is “easy”, in the sense that is computationally efficient?
“you’re jumping to the conclusion that you can reliably differentiate between...”
I think you absolutely can, and the idea was already described earlier.
You pay attention to regularities in the data. In most non-random images, pixels near to each other are similar. In an MxN image, the pixel below is a[i+M], whereas in an NxM image, it’s a[i+N]. If, across the whole image, the difference between a[i+M] is less than the difference between a[i+N], it’s more likely an MxN image. I expect you could find the resolution by searching all possible resolutions from 1x<length> to <length>x1, and finding which minimizes average distance of “adjacent” pixels.
Similarly (though you’d likely do this first), you can tell the difference between RGB and RGBA. If you have (255, 0, 0, 255, 0, 0, 255, 0, 0, 255, 0, 0), this is probably 4 red pixels in RGB, and not a fully opaque red pixel, followed by a fully transparent green pixel, followed by a fully transparent blue pixel in RGBA. It could be 2 pixels that are mostly red and slightly green in 16 bit RGB, though. Not sure how you could piece that out.
Aliens would probably do a different encoding. We don’t know what the rods and cones in their eye-equivalents are, and maybe they respond to different colors. Maybe it’s not Red Green Blue, but instead Purple Chartreuse Infrared. I’m not sure this matters. It just means your eggplants look red.
I think, even if it had 5 (or 6, or 20) channels, this regularity would be born out, between bit i and bit i+5 being less than bit i and i+1, 2, 3, or 4.
Now, there’s still a lot that that doesn’t get you yet. But given that there are ways to figure out those, I kinda think I should have decent expectations there’s ways to figure out other things, too, even if I don’t know them.
I do also think it’s important to zoom out to the original point. Eliezer posed this as an idea about AGI. We currently sometimes feed images to our AIs, and when we do, we feed them as raw RGB data, not encoded, because we know that would make it harder for the AI to figure out. I think it would be very weird, if we were trying to train an AI, to send it compressed video, and much more likely that we do, in fact, send it raw RGB values frame by frame.
I will also say that the original claim (by Eliezer, not the top of this thread), was not physics from one frame, but physics from like, 3 frames, so you get motion, and acceleration. 4 frames gets you to third derivatives, which, in our world, don’t matter that much. Having multiple frames also aids in ideas like the 3d → 2d projection, since motion and occlusion are hints at that.
And I think the whole question is “does this image look reasonable”, which you’re right, is not a rigorous mathematical concept. But “‘looks reasonable’ is not a rigorous concept” doesn’t get followed by “and therefore is impossible” Above are some of the mathematical descriptions of what ‘reasonable’ means in certain contexts. Rendering a 100x75 image as 75x100 will not “look reasonable”. But it’s not beyond thinking and math to determine what you mean by that.
The core problem remains computational complexity.
Statements like “does this image look reasonable” or saying “you pay attention to regularities in the data”, or “find the resolution by searching all possible resolutions” are all hiding high computational costs behind short English descriptions.
Let’s consider the case of a 1280x720 pixel image.
That’s the same as 921600 pixels.
How many bytes is that?
It depends. How many bytes per pixel?[1] In my post, I explained there could be 1-byte-per-pixel grayscale, or perhaps 3-bytes-per-pixel RGB using [0, 255] values for each color channel, or maybe 6-bytes-per-pixel with [0, 65535] values for each color channel, or maybe something like 4-bytes-per-pixel because we have 1-byte RGB channels and a 1-byte alpha channel.
Let’s assume that a reasonable cutoff for how many bytes per pixel an encoding could be using is say 8 bytes per pixel, or a hypothetical 64-bit color depth.
How many ways can we divide this between channels?
If we assume 3 channels, it’s 1953.
If we assume 4 channels, it’s 39711.
Also if it turns out to be 5 channels, it’s 595665.
This is a pretty fast growing function. The following is a plot.
Note that the red line is
O(2^N)
and the black line barely visible at the bottom isO(N^2)
.N^2
is a notorious runtime complexity because it’s right on the threshold of what is generally unacceptable performance.[2]Let’s hope that this file isn’t actually a frame buffer from a graphics card with 32 bits per channel or a 128 bit per pixel / 16 byte per pixel.
Unfortunately, we still need to repeat this calculation for all of the possibilities for how many bits per pixel this image could be. We need to add in the possibility that it is 63 bits per pixel, or 62 bits per pixel, or 61 bits per pixel.
In case anyone wants to claim this is unreasonable, it’s not impossible to have image formats that have RGBA data, but only 1 bit associated with the alpha data for each pixel. [3]
And for each of these scenarios, we need to question how many channels of color data there are.
1? Grayscale.
2? Grayscale, with an alpha channel maybe?
3? RGB, probably, or something like HSV.
4? RGBA, or maybe it’s the RGBG layout I described for a RAW encoding of a Bayer filter, or maybe it’s CMYK for printing.
5? This is getting weird, but it’s not impossible. We could be encoding additional metadata into each pixel, e.g. distance from the camera.
6? Actually, this question how how many channels there are is very important, given the fast growing function above.
7? This one question, if we don’t know the right answer, is sufficient to make this algorithm pretty much impossible to run.
8? When we say we can try all of options, that’s not actually possible.
9? What I think people mean is that we can use heuristics to pick the likely options first and try them, and then fall back to more esoteric options if the initial results don’t make sense.
10? That’s the difference between average run-time and worst case run-time.
11? The point that I am trying to make is that the worst case run-time for decoding an arbitrary binary file is pretty much unbounded, because there’s a ridiculous amount of choice possible.
12? Some examples of “image” formats that have large numbers of channels per “pixel” are things like RADAR / LIDAR sensors, e.g. it’s possible to have 5 channels per pixel for defining 3D coordinates (relative to the sensor), range, and intensity.
You actually ran into this problem yourself.
Summing up all of the possibilities above is left as an exercise for the reader, and we’ll call that sum
K
.Without loss of generality, let’s say our image was encoded as 3 bytes per pixel divided between 3 RGB color channels of 1 byte each.
Our 1280x720 image is actually 2764800 bytes as a binary file.
But since we’re decoding it from the other side, and we don’t know it’s 1280x720, when we’re staring at this pile of 2764800 bytes, we need to first assume how many bytes per pixel it is, so that we can divide the total bytes by the bytes per pixel to calculate the number of pixels.
Then, we need to test each possible resolutions as you’ve suggested.
The number of possible resolutions is the same as the number of divisors of the number of pixels. The equation for providing an upper bound is
exp(log(N)/log(log(N)))
[4], but the average number of divisors is approximatelylog(N)
.Oops, no it isn’t!
Files have headers! How large is the header? For a bitmap, it’s anywhere between 26 and 138 bytes. The JPEG header is at least 2 bytes. PNG uses 8 bytes. GIF uses at least 14 bytes.
Now we need to make the following choices:
Guess at how many bytes per pixel the data is.
Guess at the length of the header. (maybe it’s 0, there is no header!)
Calculate the factorization of the remaining bytes N for the different possible resolutions.
Hope that there isn’t a footer, checksum, or any type of other metadata hanging out in the sea of bytes. This is common too!
Once we’ve made our choices above, then we multiply that by
log(N)
for the number of resolutions to test, and then we’ll apply the suggested metric. Remember that when considering the different pixel formats and ways the color channel data could be represented, the number wasK
, and that’s what we’re multiplying bylog(N)
.What you’re describing here is actually similar to a common metric used in algorithms for automatically focusing cameras by calculating the contrast of an image, except for focusing you want to maximize contrast instead of minimize it.
The interesting problem with this metric is that it’s basically a one-way function. For a given image, you can compute this metric. However, minimizing this metric is not the same as knowing that you’ve decoded the image correctly. It says you’ve found a decoding, which did minimize the metric. It does not mean that is the correct decoding.
A trivial proof:
Consider an image and the reversal of that image along the horizontal axis.
These have the same metric.
So the same metric can yield two different images.
A slightly less trivial proof:
For a given “image” of
N
bytes of image data, there are2^(N*8)
possible bit patterns.Assuming the metric is calculated as an 8-byte IEEE 754 double, there are only
2^(8*8)
possible bit patterns.When
N > 8
, there are more bit patterns than values allowed in a double, so multiple images need to map to the same metric.The difference between our
2^(2764800*8)
image space and the2^64
metric is, uhhh,10^(10^6.8)
. Imagine10^(10^6.8)
pigeons. What a mess.[5]The metric cannot work as described. There will be various arbitrary interpretations of the data possible to minimize this metric, and almost all of those will result in images that are definitely not the image that was actually encoded, but did minimize the metric. There is no reliable way to do this because it isn’t possible. When you have a pile of data, and you want to reverse meaning from it, there is not one “correct” message that you can divine from it.[6] See also: numerology, for an example that doesn’t involve binary file encodings.
Even pretending that this metric did work, what’s the time complexity of it? We have to check each pixel, so it’s
O(N)
. There’s a constant factor for each pixel computation. How large is that constant? Let’s pretend it’s small and ignore it.So now we’ve got
K*O(N*log(N))
which is the time complexity of lots of useful algorithms, but we’ve got that awkward constantK
in the front. Remember that the constantK
reflects the number of choices for different bits per pixel, bits per channel, and the number of channels of data per pixel. Unfortunately, that constant is the one that was growing a rate best described as “absurd”. That constant is the actual definition of what it means to have no priors. When I said “you can generate arbitrarily many hypotheses, but if you don’t control what data you receive, and there’s no interaction possible, then you can’t rule out hypotheses”, what I’m describing is this constant.What I care about is the difference between:
1. Things that are computable.
2. Things that are computable efficiently.
These sets are not the same.
Capabilities of a superintelligent AGI lie only in the second set, not the first.
It is important to understand that a superintelligent AGI is not brute forcing this in the way that has been repeatedly described in this thread. Instead the superintelligent AGI is going to use a bunch of heuristics or knowledge about the provenance of the binary file, combined with access to the internet so that it can just lookup the various headers and features of common image formats, and it’ll go through and check all of those, and then if it isn’t any of the usual suspects, it’ll throw up metaphorical hands, and concede defeat. Or, to quote the title of this thread, intelligence isn’t magic.
This is often phrased as bits per pixel, because a variety of color depth formats use less than 8 bits per channel, or other non-byte divisions.
Refer to https://accidentallyquadratic.tumblr.com/ for examples.
A fun question to consider here becomes: where are the alpha bits stored? E.g. if we assume 3 bytes for RGB data, and then we have the 1 alpha bit, is each pixel taking up 9 bits, or are the pixels stored in runs of 8 pixels followed by a single “alpha” pixel with 8 bits describing the alpha channels of the previous 8 pixels?
https://terrytao.wordpress.com/2008/09/23/the-divisor-bound/
https://en.wikipedia.org/wiki/Pigeonhole_principle
The way this works for real reverse engineering is that we already have expectations of what the data should look like, and we are tweaking inputs and outputs until we get the data we expected. An example would be figuring out a camera’s RAW format by taking pictures of carefully chosen targets like an all white wall, or a checkerboard wall, or an all red wall, and using the knowledge of those targets to find patterns in the data that we can decode.
So, I want to note a few things. The original Eliezer post was intended to argue against this line of reasoning:
He didn’t worry about compute, because that’s not a barrier on the theoretical limit. And in his story, the entire human civilization had decades to work on this problem.
But you’re right, in a practical world, compute is important.
I feel like you’re trying to make this take as much compute as possible.
Since you talked about headers, I feel like I need to reiterate that, when we are talking to a neural network, we do not add the extra data. The goal is to communicate with the neural network, so we intentionally put it in easier to understand formats.
In the practical cases for this to come up (e.g. a nascent superintelligence figuring out physics faster than we expect), we probably will also be inputting data in an easy to understand format.
Similarly, I expect you don’t need to check every possible esoteric format. The likelihood of the image using an encoding like 61 bits per pixel, with 2 for red, 54 for green and 5 for blue is just, very low, a priori. I do admit I’m not sure if only using “reasonable” formats would cut down the possibilities into the computable realm (obviously depends on definitions of reasonable, though part of me feels like you could (with a lot of work) actually have an objective likeliness score of various encodings). But certainly it’s a lot harder to say that it isn’t than just saying “f(x) = (63 pick x), grows very fast.”
Though, since I don’t have a good sense for whether “reasonable” ones would be a more computable number, I should update in your direction. (I tried to look into something sort of analogous, and the most common 200 passwords cover a little over 4% of all used passwords, which, isn’t large enough for me to feel comfortable expecting that the most “likely” 1,000 formats would cover a significant quantity of the probability space, or anything.)
(Also potentially important. Modern neural nets don’t really receive things as a string of bits, but instead as a string of numbers, nicely split up into separate nodes. (yes, those numbers are made of bits, but they’re floating point numbers, and the way neural nets interact with them is through all the floating point operations, so I don’t think the neural net actually touches the bit representation of the number in any meaningful way.)
I don’t see the file, but again if you want to run this the mentioned dates from the other thread work best.
I am not exactly sure what you mean by the phrase “derive structure but not meaning”—a couple possibilities come to mind.
Let’s say I have a file that is composed of 1,209,600 bytes of 0s. That file was generated by taking pressure readings, in kPa, once per second on the Hubble Space Telescope for a two week period. If I said “this file is a sequence of 2^11 x 3^3 x 5^2 x 7 zeros”, would that be a minimal example of “deriving the structure but not the meaning”? If so I expect that situation to be pretty common.
If not, some further questions:
Is it coherent to “understand the meaning but not the structure”?
Would it be possible to go from “understands the structure but not the meaning” to “understands both” purely through receiving more data?
If not , what about through interaction
If not, what differences would you expect to observe between a world where I understood the structure but not the meaning of something and a world where I understood both?
I was describing a file that would fit your criteria but not be useful. I was explaining in bullet points all of the reasons why that file can’t be decoded without external knowledge.
I think that you understood the point though, with your example of data from the Hubble Space Telescope. One caveat: I want to be clear that the file does not have to be all zeroes. All zeroes would violate your criteria that the data cannot be compressed to less than 10% of it’s uncompressed size, since all zeroes can be trivially run-length-encoded.
But let’s look at this anyway.
You said the file is all zeroes, and it’s 1209600 bytes. You also said it’s pressure readings in kPa, taken once per second. You then said it’s 2^11 x 3^3 x 5^2 x 7 zeroes—I’m a little bit confused on where this number came from? That number is 9676800, which is larger than the file size in bytes. If I divide by 8, then I get the stated file size, so maybe you’re referring to the binary sequence of bits being either 0 or 1, and then on this hardware a byte is 8-bits, and that’s how those numbers connect.
In a trivial sense, yes, that is “deriving the structure but not the meaning”.
What I really meant was that we would struggle to distinguish between:
The file is 1209600 separate measurements, each 1-byte, taken by a single pressure sensor.
The file is 604800 measurements of 1 byte each from 2 redundant pressure sensors.
The file is 302400 measurements, each 4-bytes, taken by a single pressure sensor.
The file is 241920 measurements, each a 4-byte timestamp field and a 1-byte pressure sensor value.
Or considering some number of values, with some N-byte width:
The pressure sensor value is in kPa.
The pressure sensor value is in psi.
The pressure sensor value is in atmospheres.
The pressure sensor value is in lbf.
The pressure sensor value is in raw counts because it’s a direct ADC measurement, so it needs to be converted to the actual pressure via a linear transform.
We don’t know the ADC’s reference voltage.
Or the data sheet for this pressure sensor.
Your questions are good.
Probably not? I guess there’s like a weird “gotcha” answer to this question where I could describe what a format tells you, in words, but not show you the format itself, and maybe we could quibble that in such a scenario you’d understand “the meaning but not the structure”.
EDIT: I think I’ve changed my mind on this answer since posting—yes, there are scenarios where you would understand the meaning of something, but not necessarily the structure of it. A trivial example would be something like a video game save file. You know that some file represents your latest game, and that it allows you to continue and resume from where you left off. You know how that file was created (you pressed “Save Game”), and you know how to use it (press “Load Game”, select save game name), but without some amount of reverse-engineering, you don’t know the structure of it (assuming that the saved game is not stored as plain text). For non-video-game examples, something like a calibration file produced by some system where the system can both 1.) produce the calibration via some type of self-test or other procedure, and 2.) receive that calibration file. Or some type of system that can be configured by the user, and then you can save that configuration file to disc, so that you can upload it back to the system. Maybe you understand exactly what the configuration file will do, but you never bothered to learn the format of the file itself.
In general, no. The problem is that you can generate arbitrarily many hypotheses, but if you don’t control what data you receive, and there’s no interaction possible, then you can’t rule out hypotheses. You’d have to just get exceedingly lucky and repeatedly be given more data that is basically designed to be interpreted correctly, i.e. the data, even though it is in a binary format, is self-describing. These formats do exist by the way. It’s common for binary formats to include things like length prefixes telling you how many bytes follow some header. That’s the type of thing you wouldn’t notice with a single piece of data, but you would notice if you had a bunch of examples of data all sharing the same unknown schema.
Yes, this is how we actually reverse-engineer unknown binary formats. Almost always we have some proprietary software that can either produce the format, or read the format, and usually both. We don’t have the schema, and for the sake of argument let’s say we don’t want decompile the software. An example: video game save files that are stored using some ad-hoc binary schema.
What we generally do is start poking known values into the software and seeing what the output file looks like—like a specific date time, a naming something a certain field, or making sure some value or checkbox is entered in a specific way. Then we permute that entry and see how the file changes. The worse thing would be if almost the entire file changes, which tells us the file is either encrypted OR it’s a direct dump of RAM to disk from some data structure with undefined ordering, like a hash map, and the structure is determined by keys which we haven’t figured out yet.
Likewise, we do the reverse. We change the format and we plug it back into the software (or an external API, if we’re trying to understand how some website works). What we’re hoping for is an error message that gives us additional information—like we change some bytes and now it complains a length is invalid, that’s probably related to length. Or if we change some bytes and it says the time cannot be negative, then that might be a time related. Or we change any byte and it rejects the data as being invalid, whoops, probably encrypted again.
The key here is that we have the same problem as question 2 -- we can generate arbitrarily many hypotheses—but we have a solution now. We can design experiments and iteratively rule out hypotheses, over and over, until we figure out the actual meaning of the format—not just the structure of it, but what values actually represent.
Again, there are limits. For instance, there are reverse-engineered binary formats where the best we know is that some byte needs to be some constant value for some reason. Maybe it’s an internal software version? Who knows! We figured out the structure of that value—the byte at location 0x806 shall be 203 -- but we don’t know the meaning of it.
Hopefully the above has answered this.
Replying to your other post here:
My whole point is that “unprocessed sensor data” can be arbitrarily tied to hardware in ways that make it impossible to decode without knowledge of that particular hardware, e.g. ADC reference voltages, datasheets, or calibration tables.
The opposite. Bitmaps are much easier than RAW formats. A random RAW format, assuming no interaction with the camera hardware or software that can read/write that format, might as well be impossible. E.g. consider the description of a RAW format here. Would you have known that the way the camera hardware works involves pixels that are actually 4 separate color sensors arranged as (in this example) an RGBG grid (called a Bayer filter), and that to calculate the pixel colors for a bitmap you need to interpolate between 4 of those raw color sensors for each color channel on each pixel in the bitmap, and the resulting file size is going to be 3x larger? Or that the size of the output image is not the size of the sensor, so there’s dead pixels that need to be ignored during the conversion? Again, this is just describing some specific RAW format—other RAW formats work differently, because cameras don’t all use the same type of color sensor.
It was completely off-base.
BTW I see that someone did a very minimal version (with an array of floating point numbers generated by a pretty simple process) of this test, but I’m still up for doing a fuller test—my schedule now is a lot more clear than it was last month.
Which question are we trying to answer?
Is it possible to decode a file that was deliberately constructed to be decoded, without a priori knowledge? This is vaguely what That Alien Message is about, at least in the first part of the post where aliens are sending a message to humanity.
Is it possible to decode a file that has an arbitrary binary schema, without a priori knowledge? This is the discussion point that I’ve been arguing over with regard to stuff like decoding CAMERA raw formats, or sensor data from a hardware/software system. This is also the area where I disagree with That Alien Message—I don’t think that one-shot examples allow robust generalization.
I don’t think (1) is a particularly interesting question, because last weekend I convinced myself that the answer is yes, you can transfer data in a way that it can be decoded, with very few assumptions on the part of the receiver. I do have a file I created for this purpose. If you want, I’ll send you it.
I started creating a file for (2), but I’m not really sure how to gauge what is “fair” vs “deliberately obfuscated” in terms of encoding. I am conflicted. Even if I stick to encoding techniques I’ve seen in the real world, I feel like I can make choices on this file encoding that make the likelihood of others decoding it very low. That’s exactly what we’re arguing about on (2). However, I don’t think it will be particularly interesting or fun for people trying to decode it. Maybe that’s ok?
What are your thoughts?
I’m not sure either one quite captures exactly what I mean, but I think (1) is probably closer than (2), with the caveat that I don’t think the file necessarily has to be deliberately constructed to be decoded without a-priori knowledge, but it should be constructed to have as close as possible to a 1:1 mapping between the structure of the process used to capture the data and the structure of the underlying data stream.
I notice I am somewhat confused by the inclusion of camera raw formats in (2) rather than in (1) though—I would expect that moving from a file in camera raw format to a jpeg would move you substantially in the direction from (1) to (2).
It sounds like maybe you have something resembling “some sensor data of something unusual in an unconventional but not intentionally obfuscated format”? If so, that sounds pretty much exactly like what I’m looking for.
I think it’s fine if it’s not interesting or fun to decode because nobody can get a handle on the structure—if that’s the case, it will be interesting to see why we are not able to do that, and especially interesting if the file ends up looking like one of the things we would have predicted ahead of time would be decodable.
I’ve posted it here https://www.lesswrong.com/posts/BMDfYGWcsjAKzNXGz/eavesdropping-on-aliens-a-data-decoding-challenge.
Because computing isn’t that cheap, and bandwidth isn’t that expensive, and no one wants to get their GPU cluster running to save 500 bits of bandwidth.