I actually did not read the linked thread until now, I came across this post from the front page and thought this was a potentially interesting challenge.
Regarding “in the concept of the fiction”, I think this piece of data is way too human to be convincing. The noise is effectively a ‘gotcha, sprinkle in /dev/random into the data’.
Why sample with 24 bits of precision if the source image only has 8 bits of precision, and it shows. Then why only add <11 bits of noise, and uniform noise at that? It could work well if you had a 16-bit lossless source image, or even an approximation of one, but the way this image is constructed is way too artificial. (And why not gaussian noise? Or any other kind of more natural noise? Uniform noise pretty much never happens in practice.) One can also entirely separate the noise from the source data you used because 8 + 11 < 24.
JPEG-caused block artifacts were visible while I was analyzing the planes of the image, that’s why I thought the bayer filter was possibly 4x4 pixels in size. I believe you likely downsampled the image from a jpeg at approximately 2000x1200 resolution, which does affect analysis and breaks the fiction that this is raw sensor data from an alien civilization.
With these kinds of flaws I do believe cracking the PRNG is within limits since the data is already really flawed.
(1) is possibly true. At least it’s true in this case, although in practice understanding the structure of the data doesn’t actually help very much vs some of the best general purpose compressors from the PAQ family.
It doesn’t help that lossless image compression algorithms kinda suck. I can often get better results by using zpaq on a NetPBM file than using a special purpose algorithm like png or even lossless webp (although the latter is usually at least somewhat competitive with the zpaq results).
(2) I’d say my decompressor would contain useful info about the structure of the data, or at least the file format itself, however...
...it would not contain any useful representation of the pictured piece of bismuth. The lossless compression requirement hurts a lot. Reverse rendering techniques for various representations do exist, but they are either lossy, or larger than the source data.
Constructing and raytracing a NeRF / SDF / voxel grid / whatever might possibly be competitive if you had dozens (or maybe hundreds) of shots of the same bismuth piece at different angles, but it really doesn’t pay for a single image, especially at this quality, especially with all the jpeg artifacts that leaked through, and so on.
I feel like this is a bit of a wasted opportunity, you could have chosen a lot of different modalities of data, even something like a stream of data from the IMU sensor in your phone as you walk around the house. You would not need to add any artificial noise, it would already be there in the source data. Modeling that could actually be interesting (if the sample rate on the IMU was high enough for a physics-based model to help).
I also think that viewing the data ‘wrongly’ and figuring out something about it despite that is a feature, not a bug.
Updates on best results so far:
General purpose compression on the original file, using cmix:
\time ./cmix -c /ztmp/mystery_file_difficult.bin /ztmp/mystery_file_difficult.cmix
Detected block types: DEFAULT: 100.0%
2100086 bytes -> 760584 bytes in 5668.77 s.
cross entropy: 2.897
5566.69user 105.07system 1:39:57elapsed 94%CPU (0avgtext+0avgdata 18968788maxresident)k
30749016inputs+5592outputs (3812631major+12008307minor)pagefaults 0swaps
Results with knowledge about the contents of the file: https://gist.github.com/mateon1/f4e2b8e3fad338405fa793fb155ebf29 (spoilers).
Summary:
The best general-purpose method after massaging the structure of the data manages 713248 bytes.
The best purpose specific method manages to compress the data, minus headers, to 712439 bytes.
The button shows up for me despite low karma. I have looked through the client-side code, and found this snippet:
This probably means Mawrak is encountering some sort of Javascript error, since the code indicates the button should only reject the launch attempt after you press it, not before.