I was contacted by a visitor of this site asking for the following:
‘I read your nice article on file format reverse engineering and was wondering if you could give me a small tip / hint about compression / encryption. I am trying to understand a constant size file format and need to know if by any chance the file is compressed or encrypted in a simpler way, which leaves hope in cracking it.
In the case you would like to have a look at the files, I generated 2 pairs. The first file pair differs only in that one variable. The second file’s name and caption are set to “;1”;, the file 2b to “;1111111…”; (31 chars)’
As the reader seeked advice on how to proceed further and provided enough information to investigate the problem, I took a look.
The first thing that I noticed is that only the first two files were of the same size, the other two were of variable size, as can be seen below:
laughingman:~# ls -la file_* | grep -v .hex -rwxr--r-- 1 root root 1762 2012-01-20 10:47 file_1 -rwxr--r-- 1 root root 1762 2012-01-20 10:48 file_1b -rwxr--r-- 1 root root 1411 2012-01-20 10:48 file_2 -rwxr--r-- 1 root root 1473 2012-01-20 10:49 file_2b laughingman:~#
By the looks of it more than just the options outlined by the reader have changed in between the files provided, or something else has caused the sizes to change. I have not been provided with the application that generates these files and thus I am flying blind.
Investigating the file_1 pair of files first, which should only differ by 1 bit, I generated a hex dump of both the files and then looked to see what changed:
laughingman:~# xxd file_1 file_1.hex laughingman:~# xxd file_1b file_1b.hex laughingman:~# diff file_1.hex file_1b.hex 109,110c109,110 < 00006c0: 17df 4c9f 4754 4028 9f69 16da 11bb 1c97 ..L.GT@(.i...... < 00006d0: 8d7d 9913 1e05 1e48 f1dd 6286 3a4f 1ca7 .}.....H..b.:O.. --- > 00006c0: 17df 4c9f 4754 4028 9f69 16da 11ba 1c97 ..L.GT@(.i...... > 00006d0: 8d7d 9913 1e05 1e48 f1dc 6286 3a4f 1ca7 .}.....H..b.:O.. laughingman:~#
Sometimes, the unix text utilities can make your life easier… changes bolded.
As only 2 bits have changed, both being the 1st bit of the byte, encryption can be ruled out. If this file was encrypted, I would expect at least the surrounding 8 bytes to change for a 64-bit code book encipherment, if not the rest of the file in a cipher block chaining mode. At this stage, the file format could be storing two copies of the variable in different representations, obscured by a XOR algo or something similar.
When performing any engineering (reverse or otherwise) it is important to make a note of your initial assumptions and be ready to re-assess them once more information has been obtained. In this case we are assuming that the file is compressed with some sort of algorithm, encrytion isn’t being used, obfuctation could be employed and the the file is only storing *ONE* copy of the variable that changed. Otherwise we could be dealing with a stream cipher with two copies of the variable stored.
Why compression? Looking at the file, no text strings appear at any point and plotting the occurrence of each byte in the file doesn’t indicate any significant outliers. Also, from what I understand of how certain compression algorithms work, flipping a bit changes the symbol generated upon compression, so other later random bytes can also be modified.
Performing a similar analysis on the file_2 pair of files yields a whole raft of changes:
laughingman:~# xxd file_2 file_2.hex laughingman:~# xxd file_2b file_2b.hex laughingman:~# diff file_2.hex file_2b.hex 1c1 < 0000000: 0000 2001 5053 5000 0000 0577 a2cb d78c .. .PSP....w.... --- > 0000000: 0000 2001 5053 5000 0000 05b5 a2cb d78c .. .PSP.........
In this case changes happen in the first few bytes of the file and procede throughout.
However that is an interesting header. I’d guess that its file revision 1.32 (big endian and hex -> dec numbers) with a signature of ‘PSP\x00’. PlayStation Portable? Paint Shop Pro? Who knows…
Looking at the bytes from 0x08 – 0x0C, which are the only bytes that change in this section (as a 32-bit number), they seem small in big endian format and rather similar. Didn’t the files differ in size by a small amount? Turns out that this unsigned int is the amount of data in the file, after the header of 12 bytes.
Taking a pot shot
So considering the assumption that this file is compressed and there is a header of 12 bytes in the file, what can be done to uncover the format? One of the most common compression algorithms is the deflate algo as specified in rfc1951. Can we find out if this file has been deflated? A bit of python provides the answer:
laughingman:~# python Python 2.5.2 (r252:60911, Jan 4 2009, 17:40:26) [GCC 4.3.2] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> f = open("file_1", "rb") >>> data = f.read() >>> f.close() >>> import zlib >>> zlib.decompress(data[12:], -15) Traceback (most recent call last): File "", line 1, in zlib.error: Error -3 while decompressing data: invalid distance code >>>
Nope. Not deflated at this point in the file. Various other available compression algorithms and offsets from the start of the file yeilded the same result, no decompression.
The file is compressed with a odd algorithm, or compressed then obfuscated/encrypted before writing to the file. Either way this is as far as this reversing journey is going at this point. More information is the key to unlocking the files secrets and I just don’t have it. No idea what application made the file, no file extension, no idea of the content save for 1 bit or 32 bytes. This is the end my friend.
My first step would be to throw the whole application through ‘strings’ and see if anything interesting appeared. Most GPL libraries have embedded signatures that make it easy to identify them, which could lead to successfully decodingthe file. Otherwise I’d fire up a decompiler and debugger and trace execution until something interesting appeared. While this is slow and boring and painful, it does yield results. I’ve also written a few tools to help in that regard. Saving that, I’d grab file_1 and start playing with the two bytes that changed, a simple script could generate all 256 combinations of the first changed byte quickly and I’d look for error messages when opening the file, which provide more information on the format. This is fuzzing at it’s most basic.
A final thought
I was recently looking at a bunch of files, all with the same header and what looked like random data after them. These files all had different extensions but the same header internally. After spending a while looking at the raw data I picked a file with an extension that I knew the format of and investigated that. Skipping past the common header to the ‘random’ data I checked what this file was meant to look like normally. Turns out that the actual file was ‘encrypted’ by XORing each byte with a know value and the appending the header. As a XOR b = c, c XOR b = a and c XOR a = b, the known value was easy enough to recover and the unknown files decoded successfully. Sometimes knowing what the file is meant to be, by it’s extension, can significantly help in a reversing journey.