I was describing a file that would fit your criteria but not be useful. I was explaining in bullet points all of the reasons why that file can’t be decoded without external knowledge.
I think that you understood the point though, with your example of data from the Hubble Space Telescope. One caveat: I want to be clear that the file does not have to be all zeroes. All zeroes would violate your criteria that the data cannot be compressed to less than 10% of it’s uncompressed size, since all zeroes can be trivially run-length-encoded.
But let’s look at this anyway.
You said the file is all zeroes, and it’s 1209600 bytes. You also said it’s pressure readings in kPa, taken once per second. You then said it’s 2^11 x 3^3 x 5^2 x 7 zeroes—I’m a little bit confused on where this number came from? That number is 9676800, which is larger than the file size in bytes. If I divide by 8, then I get the stated file size, so maybe you’re referring to the binary sequence of bits being either 0 or 1, and then on this hardware a byte is 8-bits, and that’s how those numbers connect.
In a trivial sense, yes, that is “deriving the structure but not the meaning”.
What I really meant was that we would struggle to distinguish between:
The file is 1209600 separate measurements, each 1-byte, taken by a single pressure sensor.
The file is 604800 measurements of 1 byte each from 2 redundant pressure sensors.
The file is 302400 measurements, each 4-bytes, taken by a single pressure sensor.
The file is 241920 measurements, each a 4-byte timestamp field and a 1-byte pressure sensor value.
Or considering some number of values, with some N-byte width:
The pressure sensor value is in kPa.
The pressure sensor value is in psi.
The pressure sensor value is in atmospheres.
The pressure sensor value is in lbf.
The pressure sensor value is in raw counts because it’s a direct ADC measurement, so it needs to be converted to the actual pressure via a linear transform.
We don’t know the ADC’s reference voltage.
Or the data sheet for this pressure sensor.
Your questions are good.
1. Is it coherent to “understand the meaning but not the structure”?
Probably not? I guess there’s like a weird “gotcha” answer to this question where I could describe what a format tells you, in words, but not show you the format itself, and maybe we could quibble that in such a scenario you’d understand “the meaning but not the structure”.
EDIT: I think I’ve changed my mind on this answer since posting—yes, there are scenarios where you would understand the meaning of something, but not necessarily the structure of it. A trivial example would be something like a video game save file. You know that some file represents your latest game, and that it allows you to continue and resume from where you left off. You know how that file was created (you pressed “Save Game”), and you know how to use it (press “Load Game”, select save game name), but without some amount of reverse-engineering, you don’t know the structure of it (assuming that the saved game is not stored as plain text). For non-video-game examples, something like a calibration file produced by some system where the system can both 1.) produce the calibration via some type of self-test or other procedure, and 2.) receive that calibration file. Or some type of system that can be configured by the user, and then you can save that configuration file to disc, so that you can upload it back to the system. Maybe you understand exactly what the configuration file will do, but you never bothered to learn the format of the file itself.
2. Would it be possible to go from “understands the structure but not the meaning” to “understands both” purely through receiving more data?
In general, no. The problem is that you can generate arbitrarily many hypotheses, but if you don’t control what data you receive, and there’s no interaction possible, then you can’t rule out hypotheses. You’d have to just get exceedingly lucky and repeatedly be given more data that is basically designed to be interpreted correctly, i.e. the data, even though it is in a binary format, is self-describing. These formats do exist by the way. It’s common for binary formats to include things like length prefixes telling you how many bytes follow some header. That’s the type of thing you wouldn’t notice with a single piece of data, but you would notice if you had a bunch of examples of data all sharing the same unknown schema.
3. If not , what about through interaction
Yes, this is how we actually reverse-engineer unknown binary formats. Almost always we have some proprietary software that can either produce the format, or read the format, and usually both. We don’t have the schema, and for the sake of argument let’s say we don’t want decompile the software. An example: video game save files that are stored using some ad-hoc binary schema.
What we generally do is start poking known values into the software and seeing what the output file looks like—like a specific date time, a naming something a certain field, or making sure some value or checkbox is entered in a specific way. Then we permute that entry and see how the file changes. The worse thing would be if almost the entire file changes, which tells us the file is either encrypted OR it’s a direct dump of RAM to disk from some data structure with undefined ordering, like a hash map, and the structure is determined by keys which we haven’t figured out yet.
Likewise, we do the reverse. We change the format and we plug it back into the software (or an external API, if we’re trying to understand how some website works). What we’re hoping for is an error message that gives us additional information—like we change some bytes and now it complains a length is invalid, that’s probably related to length. Or if we change some bytes and it says the time cannot be negative, then that might be a time related. Or we change any byte and it rejects the data as being invalid, whoops, probably encrypted again.
The key here is that we have the same problem as question 2 -- we can generate arbitrarily many hypotheses—but we have a solution now. We can design experiments and iteratively rule out hypotheses, over and over, until we figure out the actual meaning of the format—not just the structure of it, but what values actually represent.
Again, there are limits. For instance, there are reverse-engineered binary formats where the best we know is that some byte needs to be some constant value for some reason. Maybe it’s an internal software version? Who knows! We figured out the structure of that value—the byte at location 0x806 shall be 203 -- but we don’t know the meaning of it.
4. If not, what differences would you expect to observe between a world where I understood the structure but not the meaning of something and a world where I understood both?
I don’t think this algorithm could decode arbitrary data in a reasonable amount of time. I think it could decode some particularly structured types of data, and I think “fairly unprocessed sensor data from a very large number of nearly identical sensors” is one of those types of data.
My whole point is that “unprocessed sensor data” can be arbitrarily tied to hardware in ways that make it impossible to decode without knowledge of that particular hardware, e.g. ADC reference voltages, datasheets, or calibration tables.
RAW I’d expect is better than a bitmap, assuming no encryption step, on the hypothesis that more data is better.
The opposite. Bitmaps are much easier than RAW formats. A random RAW format, assuming no interaction with the camera hardware or software that can read/write that format, might as well be impossible. E.g. consider the description of a RAW format here. Would you have known that the way the camera hardware works involves pixels that are actually 4 separate color sensors arranged as (in this example) an RGBG grid (called a Bayer filter), and that to calculate the pixel colors for a bitmap you need to interpolate between 4 of those raw color sensors for each color channel on each pixel in the bitmap, and the resulting file size is going to be 3x larger? Or that the size of the output image is not the size of the sensor, so there’s dead pixels that need to be ignored during the conversion? Again, this is just describing some specific RAW format—other RAW formats work differently, because cameras don’t all use the same type of color sensor.
The point as I see it is more about whether it’s possible with a halfway reasonable amount of compute or whether That Alien Message was completely off-base.
BTW I see that someone did a very minimal version (with an array of floating point numbers generated by a pretty simple process) of this test, but I’m still up for doing a fuller test—my schedule now is a lot more clear than it was last month.
Is it possible to decode a file that was deliberately constructed to be decoded, without a priori knowledge? This is vaguely what That Alien Message is about, at least in the first part of the post where aliens are sending a message to humanity.
Is it possible to decode a file that has an arbitrary binary schema, without a priori knowledge? This is the discussion point that I’ve been arguing over with regard to stuff like decoding CAMERA raw formats, or sensor data from a hardware/software system. This is also the area where I disagree with That Alien Message—I don’t think that one-shot examples allow robust generalization.
I don’t think (1) is a particularly interesting question, because last weekend I convinced myself that the answer is yes, you can transfer data in a way that it can be decoded, with very few assumptions on the part of the receiver. I do have a file I created for this purpose. If you want, I’ll send you it.
I started creating a file for (2), but I’m not really sure how to gauge what is “fair” vs “deliberately obfuscated” in terms of encoding. I am conflicted. Even if I stick to encoding techniques I’ve seen in the real world, I feel like I can make choices on this file encoding that make the likelihood of others decoding it very low. That’s exactly what we’re arguing about on (2). However, I don’t think it will be particularly interesting or fun for people trying to decode it. Maybe that’s ok?
I’m not sure either one quite captures exactly what I mean, but I think (1) is probably closer than (2), with the caveat that I don’t think the file necessarily has to be deliberately constructed to be decoded without a-priori knowledge, but it should be constructed to have as close as possible to a 1:1 mapping between the structure of the process used to capture the data and the structure of the underlying data stream.
I notice I am somewhat confused by the inclusion of camera raw formats in (2) rather than in (1) though—I would expect that moving from a file in camera raw format to a jpeg would move you substantially in the direction from (1) to (2).
It sounds like maybe you have something resembling “some sensor data of something unusual in an unconventional but not intentionally obfuscated format”? If so, that sounds pretty much exactly like what I’m looking for.
However, I don’t think it will be particularly interesting or fun for people trying to decode it. Maybe that’s ok?
I think it’s fine if it’s not interesting or fun to decode because nobody can get a handle on the structure—if that’s the case, it will be interesting to see why we are not able to do that, and especially interesting if the file ends up looking like one of the things we would have predicted ahead of time would be decodable.
I was describing a file that would fit your criteria but not be useful. I was explaining in bullet points all of the reasons why that file can’t be decoded without external knowledge.
I think that you understood the point though, with your example of data from the Hubble Space Telescope. One caveat: I want to be clear that the file does not have to be all zeroes. All zeroes would violate your criteria that the data cannot be compressed to less than 10% of it’s uncompressed size, since all zeroes can be trivially run-length-encoded.
But let’s look at this anyway.
You said the file is all zeroes, and it’s 1209600 bytes. You also said it’s pressure readings in kPa, taken once per second. You then said it’s 2^11 x 3^3 x 5^2 x 7 zeroes—I’m a little bit confused on where this number came from? That number is 9676800, which is larger than the file size in bytes. If I divide by 8, then I get the stated file size, so maybe you’re referring to the binary sequence of bits being either 0 or 1, and then on this hardware a byte is 8-bits, and that’s how those numbers connect.
In a trivial sense, yes, that is “deriving the structure but not the meaning”.
What I really meant was that we would struggle to distinguish between:
The file is 1209600 separate measurements, each 1-byte, taken by a single pressure sensor.
The file is 604800 measurements of 1 byte each from 2 redundant pressure sensors.
The file is 302400 measurements, each 4-bytes, taken by a single pressure sensor.
The file is 241920 measurements, each a 4-byte timestamp field and a 1-byte pressure sensor value.
Or considering some number of values, with some N-byte width:
The pressure sensor value is in kPa.
The pressure sensor value is in psi.
The pressure sensor value is in atmospheres.
The pressure sensor value is in lbf.
The pressure sensor value is in raw counts because it’s a direct ADC measurement, so it needs to be converted to the actual pressure via a linear transform.
We don’t know the ADC’s reference voltage.
Or the data sheet for this pressure sensor.
Your questions are good.
Probably not? I guess there’s like a weird “gotcha” answer to this question where I could describe what a format tells you, in words, but not show you the format itself, and maybe we could quibble that in such a scenario you’d understand “the meaning but not the structure”.
EDIT: I think I’ve changed my mind on this answer since posting—yes, there are scenarios where you would understand the meaning of something, but not necessarily the structure of it. A trivial example would be something like a video game save file. You know that some file represents your latest game, and that it allows you to continue and resume from where you left off. You know how that file was created (you pressed “Save Game”), and you know how to use it (press “Load Game”, select save game name), but without some amount of reverse-engineering, you don’t know the structure of it (assuming that the saved game is not stored as plain text). For non-video-game examples, something like a calibration file produced by some system where the system can both 1.) produce the calibration via some type of self-test or other procedure, and 2.) receive that calibration file. Or some type of system that can be configured by the user, and then you can save that configuration file to disc, so that you can upload it back to the system. Maybe you understand exactly what the configuration file will do, but you never bothered to learn the format of the file itself.
In general, no. The problem is that you can generate arbitrarily many hypotheses, but if you don’t control what data you receive, and there’s no interaction possible, then you can’t rule out hypotheses. You’d have to just get exceedingly lucky and repeatedly be given more data that is basically designed to be interpreted correctly, i.e. the data, even though it is in a binary format, is self-describing. These formats do exist by the way. It’s common for binary formats to include things like length prefixes telling you how many bytes follow some header. That’s the type of thing you wouldn’t notice with a single piece of data, but you would notice if you had a bunch of examples of data all sharing the same unknown schema.
Yes, this is how we actually reverse-engineer unknown binary formats. Almost always we have some proprietary software that can either produce the format, or read the format, and usually both. We don’t have the schema, and for the sake of argument let’s say we don’t want decompile the software. An example: video game save files that are stored using some ad-hoc binary schema.
What we generally do is start poking known values into the software and seeing what the output file looks like—like a specific date time, a naming something a certain field, or making sure some value or checkbox is entered in a specific way. Then we permute that entry and see how the file changes. The worse thing would be if almost the entire file changes, which tells us the file is either encrypted OR it’s a direct dump of RAM to disk from some data structure with undefined ordering, like a hash map, and the structure is determined by keys which we haven’t figured out yet.
Likewise, we do the reverse. We change the format and we plug it back into the software (or an external API, if we’re trying to understand how some website works). What we’re hoping for is an error message that gives us additional information—like we change some bytes and now it complains a length is invalid, that’s probably related to length. Or if we change some bytes and it says the time cannot be negative, then that might be a time related. Or we change any byte and it rejects the data as being invalid, whoops, probably encrypted again.
The key here is that we have the same problem as question 2 -- we can generate arbitrarily many hypotheses—but we have a solution now. We can design experiments and iteratively rule out hypotheses, over and over, until we figure out the actual meaning of the format—not just the structure of it, but what values actually represent.
Again, there are limits. For instance, there are reverse-engineered binary formats where the best we know is that some byte needs to be some constant value for some reason. Maybe it’s an internal software version? Who knows! We figured out the structure of that value—the byte at location 0x806 shall be 203 -- but we don’t know the meaning of it.
Hopefully the above has answered this.
Replying to your other post here:
My whole point is that “unprocessed sensor data” can be arbitrarily tied to hardware in ways that make it impossible to decode without knowledge of that particular hardware, e.g. ADC reference voltages, datasheets, or calibration tables.
The opposite. Bitmaps are much easier than RAW formats. A random RAW format, assuming no interaction with the camera hardware or software that can read/write that format, might as well be impossible. E.g. consider the description of a RAW format here. Would you have known that the way the camera hardware works involves pixels that are actually 4 separate color sensors arranged as (in this example) an RGBG grid (called a Bayer filter), and that to calculate the pixel colors for a bitmap you need to interpolate between 4 of those raw color sensors for each color channel on each pixel in the bitmap, and the resulting file size is going to be 3x larger? Or that the size of the output image is not the size of the sensor, so there’s dead pixels that need to be ignored during the conversion? Again, this is just describing some specific RAW format—other RAW formats work differently, because cameras don’t all use the same type of color sensor.
It was completely off-base.
BTW I see that someone did a very minimal version (with an array of floating point numbers generated by a pretty simple process) of this test, but I’m still up for doing a fuller test—my schedule now is a lot more clear than it was last month.
Which question are we trying to answer?
Is it possible to decode a file that was deliberately constructed to be decoded, without a priori knowledge? This is vaguely what That Alien Message is about, at least in the first part of the post where aliens are sending a message to humanity.
Is it possible to decode a file that has an arbitrary binary schema, without a priori knowledge? This is the discussion point that I’ve been arguing over with regard to stuff like decoding CAMERA raw formats, or sensor data from a hardware/software system. This is also the area where I disagree with That Alien Message—I don’t think that one-shot examples allow robust generalization.
I don’t think (1) is a particularly interesting question, because last weekend I convinced myself that the answer is yes, you can transfer data in a way that it can be decoded, with very few assumptions on the part of the receiver. I do have a file I created for this purpose. If you want, I’ll send you it.
I started creating a file for (2), but I’m not really sure how to gauge what is “fair” vs “deliberately obfuscated” in terms of encoding. I am conflicted. Even if I stick to encoding techniques I’ve seen in the real world, I feel like I can make choices on this file encoding that make the likelihood of others decoding it very low. That’s exactly what we’re arguing about on (2). However, I don’t think it will be particularly interesting or fun for people trying to decode it. Maybe that’s ok?
What are your thoughts?
I’m not sure either one quite captures exactly what I mean, but I think (1) is probably closer than (2), with the caveat that I don’t think the file necessarily has to be deliberately constructed to be decoded without a-priori knowledge, but it should be constructed to have as close as possible to a 1:1 mapping between the structure of the process used to capture the data and the structure of the underlying data stream.
I notice I am somewhat confused by the inclusion of camera raw formats in (2) rather than in (1) though—I would expect that moving from a file in camera raw format to a jpeg would move you substantially in the direction from (1) to (2).
It sounds like maybe you have something resembling “some sensor data of something unusual in an unconventional but not intentionally obfuscated format”? If so, that sounds pretty much exactly like what I’m looking for.
I think it’s fine if it’s not interesting or fun to decode because nobody can get a handle on the structure—if that’s the case, it will be interesting to see why we are not able to do that, and especially interesting if the file ends up looking like one of the things we would have predicted ahead of time would be decodable.
I’ve posted it here https://www.lesswrong.com/posts/BMDfYGWcsjAKzNXGz/eavesdropping-on-aliens-a-data-decoding-challenge.