Let’s say I make a prediction that Barack Obama is a werewolf. I don’t want anyone to know I’ve made this prediction, because otherwise the Secret Werewolf Police will come and eat me, but I do want to lord it over everyone after the fact.
I take the string “Barack Obama is a werewolf” and I put it through a hashing function, (for purposes of this example, the hashing function MD5). This produces the output 37ecb0a3164e6422bedc0f8db82e45ec. The original string about Barack Obama is not recoverable from this output, because MD5 is a lossy function, but anyone else putting that string through the MD5 hashing function will get the same output. The hash output is like a fingerprint for the original string.
So if I put 37ecb0a3164e6422bedc0f8db82e45ec in a public place a year before Barack Obama transforms into a werewolf on live television, after the fact I can give people the original string and they can verify it for themselves against the hash output.
I use a Chrome plugin for most tasks that involve basic encoding of text values. (If I’m honest, I mostly use it for rot13, and more exotic uses aren’t that common). There are also online hash generators for different hashing functions. Some popular hashing functions for this purpose are MD5 and SHA1.
MD5 is a poor choice, though it’s probably OK for this sort of very short plain-text prediction. SHA-1 is looking fairly sha-ky too these days, but again is probably fine for this purpose. If you’re picking a hash function for actual cryptographic purposes, SHA-256 would be a better choice.
“Barack Obama is a werewolf, and on an unrelated note, here is my favourite block of nonsense text:
bVcDvX2JLDLthTNhOeOvBXiETMysmBR5LuuPrgM6nCOIGx
i8zomRQXH5RMQjqTtSj8vWDCXIyBH7NrQu96bozpaGo5g7k
JeTUeigwK7YBvFJxCIcbmMaAT213sbi8l4TfWmU3BUFNQaM
nAudBqKc0vPEBYRIl1TsQYTCLk0jT2EhP8Qx9wOZcxJ1o30w”
(ETA: this is for entertainment purposes only, and is not intended as any kind of counterpoint to gjm’s entirely sound and valuable comment)
I suggest adding a non-trivial random string to the original text. Otherwise if someone else makes the same prediction, your secret is immediately known to vim, and the information that you two are making the same prediction leaks to everyone.
I encrypt messages for a another, goofier purpose. One of the people I am encrypting from is a compsci professor.
I use a Vigenere cipher, which should beat everything short of the Secret Werewolf Police, and possibly them, too. (It is, however, more crackable than a proper salted, hashed output.)
In a Vigenere, the letters in your input are moved by the numerical equivalent of the key, and the key repeats. Example:
Secret Statement/lie: Cats are nice.
Key: ABC
New, coded statement: dcwt (down 1, 2, 3, 1) cuf ildg. Now, I recommend using long keys and spacing the output in five letter blocks to prevent easier soliving.
Again Vigenere’s are potentially crackable, but they are very hard. It’s easier for the werewolf police to just come and eat anyone who puts up hashed or Vigenere ciphered predictions.
Vigeneres are potentially crackable, but they are very hard.
Unless your text is very short or your key very long, they are much easier than you apparently think. I have written programs that do a pretty good job of it.
(As MrMind says, if your key is as long as your text then this is a one-time pad, which indeed is uncrackable provided you have a reliable channel for sharing the one-time pad and don’t make operational mistakes like reusing the key. But that’s not a very common situation.)
I finally did get round to taking a look at this, and it isn’t yielding to the usual methods. In fact, it’s sufficiently unyielding that I am wondering whether it’s really the result of Vigenere-ciphering something that resembles ordinary English text with a key of reasonable length. I find the following:
The text is 146 characters long. I would expect that, with “ordinary” plaintext, to be enough to identify a key of at most about 7 characters without too much pain. I actually tried lengths up to 10.
For no key length k does the distribution of { (a-b) mod 26 : a,b are characters at the same position mod k } much resemble the distribution of differences mod 26 of randomly chosen letters with typical English letter frequencies.
For no key length k does the result of the following procedure produce something that looks as if most of the shifts are correct:
For each offset, count letter frequencies, and choose a shift that maximizes the sum of log(typical_prob) where typical_prob are those typical English letter frequencies again.
JRMayne, if you happen to be reading this and happen to have either a record or a memory of the message and how you encrypted it: can you confirm that (1) this really is the output of a Vigenere cipher, (2) the plaintext isn’t (statistically) outrageously unlike ordinary English, and (3) the key length is at most, let’s say, 10?
(If the key is much longer than that, there isn’t enough information for these simple methods to work with. Perhaps, e.g., doing a more expensive search and looking at digraph frequencies might be worth considering, but I’m too lazy.)
Gah. I don’t remember the solution or the key. And I just last week had a computer crash (replaced it), so, I’ve got lots of nothing. Sorry.
I am sure of (1) and (2). I don’t remember (3), and it’s very possible it’s longer than 10 (though probably not much longer.) But I don’t remember clearly. That’s the best I can do.
Offsets 10, 8, 25, 6, 6, 14, 21, 11, 12, 13, 11, 12, 15, 21, 1, 7 (that’s a key length of 16) yield the following decrypted plaintext: “failurearrivedquicklyatgjmshandsastheirprogramdefeatedmedepressinglyquicklyimustconcedethepointiwasoverconfidentandunderinformedandgjmfreakingwins”.
The actual process involved manual intervention, though I think a smarter program could have done it automatically. It went like this:
Consider key lengths up to 20.
Instead of picking the single “best” shift at each offset within the key, consider all shifts that come within 1 unit of best. (Meaning a factor of e in estimated probability.)
Display all the decryptions resulting from these options, unless there are more than 300 in which case display 50 random ones.
This produced a big long list of candidates. None of them was right, but some of them looked more credible than others. Keylength 16 seemed to let us do the best, so I focused on that and picked one of the better-looking candidates, which looked like this:
which has lots of plausible bits in it. It looks like it should maybe start with “failure”, a plausible enough word in this context, so I tweaked the fifth offset (yes! lots of other things got better too), and continued tweaking one by one as likely words and word-fragments jumped out at me. After a few tweaks I had the decryption above.
A program clever enough to do this by itself would need to have some inkling about digraphs or trigraphs or something, and then it would need some kind of iterative search procedure rather than just picking best offsets independently for each letter. Not terribly difficult, but it seemed easier to do something simpler-minded.
[Edited to add: I think the sequence of giveaway words for the tweaks was: “failure”, “quickly” #2, “overconfident”, done.]
I won’t actually try to crack this since writing the program would take more time than I’m inclined to spend here, but the principle is simple: Do frequency analysis based on assuming the key is a particular length. If taking every 5th character gives you a letter frequency that resembles English, but taking every 4th or 6th character does not, then the key is 5 letters long. Proceed from there.
Yup. (One handy trick is to take message—shift(message,k) mod 26 for candidate values of k, and see which has a histogram that looks most like that of (English text minus other English text). But it’s probably actually better to find good candidate shifts for each evenly-spaced subset of the text and look for ones that match up.)
JRMayne’s latest sample is still pretty short. It might require human cleverness to solve. If I have time tomorrow (and unless someone else has done it first) I may have a go.
You understood my question better than I did. Thank you.
Is the following paragraph correct?:
If I had unlimited computing power, I could search for all the inputs that return 37ecb0a3164e6422bedc0f8db82e45ec from the MD5 function. Then I could search those inputs and see which ones are meaningful sentences in English, and then make an educated guess at what the message is. But in reality, it would take too much computing power to find even a single string that returns 37ecb0a3164e6422bedc0f8db82e45ec.
It’s not quite correct, but you’ve broadly got the thrust of it.
When two different inputs produce the same output from a hashing function, this is called a hash collision. Finding collisions in the SHA256 hashing function is part of how bitcoin mining works. It is very computationally expensive (which is kind of the point, re: bitcoin mining), but it’s certainly tractable to find some input that generates the same output as another one.
If you were to try and search the space of all possible inputs for MD5, you’d quickly(ish) find an input that collided with the Obama Werewolf input, but it’d be garbage. If you had some system for quickly and reliably generating comprehensible English sentences, and just searched that space, you’d probably find a comprehensible English sentence that collided with it, but it would almost certainly not be the Obama Werewolf sentence. It’s worth noting that MD5 will take an input of any length, and the space of possible comprehensible English sentences of unbounded length is basically infinite.
For short snippets of text, it’s hard to find two comprehensible English sentences that collide under MD5, but in the link gjm provides above, there’s a method for forcing the MD5 hash of a PDF by exploiting how MD5 works. For this reason and others, if you’re doing any cryptographic heavy lifting, you probably want to use something more robust than MD5.
If you were to try and search the space of all possible inputs for MD5, you’d quickly(ish) find an input that collided with the Obama Werewolf input, but it’d be garbage.
Really? Last I checked, the best known preimage attack against MD5 was too slow to be practical. Finding collisions is drastically easier, though I don’t know any method for doing it with arbitrary plain-text English sentences.
You said you use this for ROT13ing things. How does that work? Suppose I wanted to make a Facebook post that gave my friends the option of knowing Obama is a werewolf, but also gave them the option for them not to know this if they’d rather be surprised later.
The Chrome plugin I linked to lets you carry out basic encoding functions in the browser, which includes stuff like rot13, URL-safe strings, a couple of hashing functions, etc.
Let’s make sure we’re talking about cryptographic hashing functions.
Let’s say I make a prediction that Barack Obama is a werewolf. I don’t want anyone to know I’ve made this prediction, because otherwise the Secret Werewolf Police will come and eat me, but I do want to lord it over everyone after the fact.
I take the string “Barack Obama is a werewolf” and I put it through a hashing function, (for purposes of this example, the hashing function MD5). This produces the output 37ecb0a3164e6422bedc0f8db82e45ec. The original string about Barack Obama is not recoverable from this output, because MD5 is a lossy function, but anyone else putting that string through the MD5 hashing function will get the same output. The hash output is like a fingerprint for the original string.
So if I put 37ecb0a3164e6422bedc0f8db82e45ec in a public place a year before Barack Obama transforms into a werewolf on live television, after the fact I can give people the original string and they can verify it for themselves against the hash output.
I use a Chrome plugin for most tasks that involve basic encoding of text values. (If I’m honest, I mostly use it for rot13, and more exotic uses aren’t that common). There are also online hash generators for different hashing functions. Some popular hashing functions for this purpose are MD5 and SHA1.
Does this answer your question?
MD5 is a poor choice, though it’s probably OK for this sort of very short plain-text prediction. SHA-1 is looking fairly sha-ky too these days, but again is probably fine for this purpose. If you’re picking a hash function for actual cryptographic purposes, SHA-256 would be a better choice.
“Barack Obama is a werewolf, and on an unrelated note, here is my favourite block of nonsense text: bVcDvX2JLDLthTNhOeOvBXiETMysmBR5LuuPrgM6nCOIGx i8zomRQXH5RMQjqTtSj8vWDCXIyBH7NrQu96bozpaGo5g7k JeTUeigwK7YBvFJxCIcbmMaAT213sbi8l4TfWmU3BUFNQaM nAudBqKc0vPEBYRIl1TsQYTCLk0jT2EhP8Qx9wOZcxJ1o30w”
(ETA: this is for entertainment purposes only, and is not intended as any kind of counterpoint to gjm’s entirely sound and valuable comment)
I suggest adding a non-trivial random string to the original text. Otherwise if someone else makes the same prediction, your secret is immediately known to vim, and the information that you two are making the same prediction leaks to everyone.
Read on salting for more information.
I encrypt messages for a another, goofier purpose. One of the people I am encrypting from is a compsci professor.
I use a Vigenere cipher, which should beat everything short of the Secret Werewolf Police, and possibly them, too. (It is, however, more crackable than a proper salted, hashed output.)
In a Vigenere, the letters in your input are moved by the numerical equivalent of the key, and the key repeats. Example:
Secret Statement/lie: Cats are nice. Key: ABC
New, coded statement: dcwt (down 1, 2, 3, 1) cuf ildg. Now, I recommend using long keys and spacing the output in five letter blocks to prevent easier soliving.
You can do this online:
http://sharkysoft.com/vigenere/
This will transmute “It seems unlikely the werewolf police will catch you,” with the key “The movie ends the same way for all of us JRM.” to:
Cbxrt ibzsz mdytd minyzw wxqdt cjeph bfqhr leuqh oxvbg tn.
(Letter grouping by me.)
Again Vigenere’s are potentially crackable, but they are very hard. It’s easier for the werewolf police to just come and eat anyone who puts up hashed or Vigenere ciphered predictions.
In the case above, since the key is as long as the plaintext, it’s no more Vigenere, it’s one time pad.
Unless your text is very short or your key very long, they are much easier than you apparently think. I have written programs that do a pretty good job of it.
(As MrMind says, if your key is as long as your text then this is a one-time pad, which indeed is uncrackable provided you have a reliable channel for sharing the one-time pad and don’t make operational mistakes like reusing the key. But that’s not a very common situation.)
I agree that I made my key too long so it’s a one-time pad. You’re right.
“Much easier”? With or without word lengths?OK, no obligation, but I didn’t make this too brutal:
Vsjfodjpfexjpipnyulfsmyvxzhvlsclqkubyuwefbvflrcxvwbnyprtrrefpxrbdymskgnryynwxzrmsgowypjivrectssbmst ipqwrcauwojmmqfeohpjgwauccrdwqfeadykgsnzwylvbdk.
(Again, no obligation to play, and no inference should be taken against gjm’s hypothesis if they decline.)
I finally did get round to taking a look at this, and it isn’t yielding to the usual methods. In fact, it’s sufficiently unyielding that I am wondering whether it’s really the result of Vigenere-ciphering something that resembles ordinary English text with a key of reasonable length. I find the following:
The text is 146 characters long. I would expect that, with “ordinary” plaintext, to be enough to identify a key of at most about 7 characters without too much pain. I actually tried lengths up to 10.
For no key length k does the distribution of { (a-b) mod 26 : a,b are characters at the same position mod k } much resemble the distribution of differences mod 26 of randomly chosen letters with typical English letter frequencies.
For no key length k does the result of the following procedure produce something that looks as if most of the shifts are correct:
For each offset, count letter frequencies, and choose a shift that maximizes the sum of log(typical_prob) where typical_prob are those typical English letter frequencies again.
JRMayne, if you happen to be reading this and happen to have either a record or a memory of the message and how you encrypted it: can you confirm that (1) this really is the output of a Vigenere cipher, (2) the plaintext isn’t (statistically) outrageously unlike ordinary English, and (3) the key length is at most, let’s say, 10?
(If the key is much longer than that, there isn’t enough information for these simple methods to work with. Perhaps, e.g., doing a more expensive search and looking at digraph frequencies might be worth considering, but I’m too lazy.)
Gah. I don’t remember the solution or the key. And I just last week had a computer crash (replaced it), so, I’ve got lots of nothing. Sorry.
I am sure of (1) and (2). I don’t remember (3), and it’s very possible it’s longer than 10 (though probably not much longer.) But I don’t remember clearly. That’s the best I can do.
Drat.
Offsets 10, 8, 25, 6, 6, 14, 21, 11, 12, 13, 11, 12, 15, 21, 1, 7 (that’s a key length of 16) yield the following decrypted plaintext: “failurearrivedquicklyatgjmshandsastheirprogramdefeatedmedepressinglyquicklyimustconcedethepointiwasoverconfidentandunderinformedandgjmfreakingwins”.
The actual process involved manual intervention, though I think a smarter program could have done it automatically. It went like this:
Consider key lengths up to 20.
Instead of picking the single “best” shift at each offset within the key, consider all shifts that come within 1 unit of best. (Meaning a factor of e in estimated probability.)
Display all the decryptions resulting from these options, unless there are more than 300 in which case display 50 random ones.
This produced a big long list of candidates. None of them was right, but some of them looked more credible than others. Keylength 16 seemed to let us do the best, so I focused on that and picked one of the better-looking candidates, which looked like this:
which has lots of plausible bits in it. It looks like it should maybe start with “failure”, a plausible enough word in this context, so I tweaked the fifth offset (yes! lots of other things got better too), and continued tweaking one by one as likely words and word-fragments jumped out at me. After a few tweaks I had the decryption above.
A program clever enough to do this by itself would need to have some inkling about digraphs or trigraphs or something, and then it would need some kind of iterative search procedure rather than just picking best offsets independently for each letter. Not terribly difficult, but it seemed easier to do something simpler-minded.
[Edited to add: I think the sequence of giveaway words for the tweaks was: “failure”, “quickly” #2, “overconfident”, done.]
Fantastic!
Well done, sir.
So I think the conclusion is that solving Vigenere ciphers is easier than you thought it was but harder than you thought I thought it was :-).
I won’t actually try to crack this since writing the program would take more time than I’m inclined to spend here, but the principle is simple: Do frequency analysis based on assuming the key is a particular length. If taking every 5th character gives you a letter frequency that resembles English, but taking every 4th or 6th character does not, then the key is 5 letters long. Proceed from there.
Yup. (One handy trick is to take message—shift(message,k) mod 26 for candidate values of k, and see which has a histogram that looks most like that of (English text minus other English text). But it’s probably actually better to find good candidate shifts for each evenly-spaced subset of the text and look for ones that match up.)
JRMayne’s latest sample is still pretty short. It might require human cleverness to solve. If I have time tomorrow (and unless someone else has done it first) I may have a go.
(No, didn’t have time. Sorry. Probably no time on Monday either...)
You understood my question better than I did. Thank you.
Is the following paragraph correct?:
If I had unlimited computing power, I could search for all the inputs that return 37ecb0a3164e6422bedc0f8db82e45ec from the MD5 function. Then I could search those inputs and see which ones are meaningful sentences in English, and then make an educated guess at what the message is. But in reality, it would take too much computing power to find even a single string that returns 37ecb0a3164e6422bedc0f8db82e45ec.
It’s not quite correct, but you’ve broadly got the thrust of it.
When two different inputs produce the same output from a hashing function, this is called a hash collision. Finding collisions in the SHA256 hashing function is part of how bitcoin mining works. It is very computationally expensive (which is kind of the point, re: bitcoin mining), but it’s certainly tractable to find some input that generates the same output as another one.
If you were to try and search the space of all possible inputs for MD5, you’d quickly(ish) find an input that collided with the Obama Werewolf input, but it’d be garbage. If you had some system for quickly and reliably generating comprehensible English sentences, and just searched that space, you’d probably find a comprehensible English sentence that collided with it, but it would almost certainly not be the Obama Werewolf sentence. It’s worth noting that MD5 will take an input of any length, and the space of possible comprehensible English sentences of unbounded length is basically infinite.
For short snippets of text, it’s hard to find two comprehensible English sentences that collide under MD5, but in the link gjm provides above, there’s a method for forcing the MD5 hash of a PDF by exploiting how MD5 works. For this reason and others, if you’re doing any cryptographic heavy lifting, you probably want to use something more robust than MD5.
Really? Last I checked, the best known preimage attack against MD5 was too slow to be practical. Finding collisions is drastically easier, though I don’t know any method for doing it with arbitrary plain-text English sentences.
You said you use this for ROT13ing things. How does that work? Suppose I wanted to make a Facebook post that gave my friends the option of knowing Obama is a werewolf, but also gave them the option for them not to know this if they’d rather be surprised later.
The Chrome plugin I linked to lets you carry out basic encoding functions in the browser, which includes stuff like rot13, URL-safe strings, a couple of hashing functions, etc.