Corrupted Images with a twist

The problem

I grabbed my personal old backup hard disk and check what was actually on it, since I couldn’t remember (start to label!).

While browsing the files a little I noticed that none of the images work anymore. No application recognized them as valid JPEG/GIF/whatever. Other files such as text and music worked fine though. The standard image viewer of Cinnamon gave me an additional hint; it said: “Not a JPEG file: starts with 0x03 0x00”

corrupted image corrupted image

So I took a look at the binary form of the file with hexdump -C 1cup.jpg and saw this:

00000000  03 00 00 00 02 00 00 00  ac 00 00 00 00 00 00 00  |................|
00000010  00 00 00 00 01 00 04 80  14 00 00 00 30 00 00 00  |............0...|
00000020  00 00 00 00 4c 00 00 00  01 05 00 00 00 00 00 05  |....L...........|
00000030  15 00 00 00 f6 3d 3e a7  87 82 14 d6 fa d4 89 69  |.....=>........i|
00000040  e9 03 00 00 01 05 00 00  00 00 00 05 15 00 00 00  |................|
00000050  f6 3d 3e a7 87 82 14 d6  fa d4 89 69 e9 03 00 00  |.=>........i....|
00000060  02 00 60 00 04 00 00 00  00 00 18 00 ff 01 1f 00  |..`.............|
00000070  01 02 00 00 00 00 00 05  20 00 00 00 20 02 00 00  |........ ... ...|
00000080  00 00 14 00 ff 01 1f 00  01 01 00 00 00 00 00 05  |................|
00000090  12 00 00 00 00 00 14 00  bf 01 13 00 01 01 00 00  |................|
000000a0  00 00 00 05 0b 00 00 00  00 00 18 00 a9 00 12 00  |................|
000000b0  01 02 00 00 00 00 00 05  20 00 00 00 21 02 00 00  |........ ...!...|
000000c0  01 00 00 00 00 00 00 00  50 b6 00 00 00 00 00 00  |........P.......|
000000d0  00 00 00 00 ff d8 ff e0  00 10 4a 46 49 46 00 01  |..........JFIF..|
000000e0  01 01 00 60 00 60 00 00  ff db 00 43 00 06 04 04  |...`.`.....C....|
…

Wow, so the file has some weird header but looks good starting at offset 0x000000d4 according to Wikipedia’s list of file signatures for the entry of JPEGs.

I am not entirely sure where that header came from. Either some export/backup went wrong or there was some virus on my Windows machine back in the day which added a small something to each file. If you know what’s going on, drop me a message via one of the channels to the left (preferrably Twitter).

The solution

In order to see the actual image now, I simply catted the file with an offset to get rid of the prefix. This could be with Linux with easy by using the command tail -c +213 1cup.jpg > 1cup_recovered.jpg. The 213 is the hexadecimal offset of d4 in decimal +1 (see Post-fence-error, typical problem in basic programming etc).

Voilà! The image could now be opened and I could finally look at the medieval joke about that weird video we all know (but rather wouldn’t like to know).

recovered image recovered image

Now that I know the files were still intact but prefixed with some weirdness, I checked multiple files if the offset would always be around 213 characters.

And it was exactly the case! Not only that, but all files were equally broken, MP3s included. But here, the players didn’t bother at all.

Now that we know all files have the headers, I only needed a simple Python script which would strip this prefix from all files recursively.

#!/usr/bin/env python3

import logging
import argparse
import os

CORRUPT_HEADER = b'\x03\x00'
ORIGINAl_HEADER_OFFSET = 0xd4

def fix_files(path: str, recursive: bool):
    for node in os.listdir(path):
        rel_path = path + '/' + node
        if recursive and os.path.isdir(rel_path):
            fix_files(rel_path, True)
        
        if os.path.isfile(rel_path):
            logging.debug('Checking: {}'.format(rel_path))
            # Check if file is affected
            with open(rel_path, 'rb') as f:
                file_header = f.read(len(CORRUPT_HEADER))
                if file_header != CORRUPT_HEADER:
                    # File is not affected or already fixed
                    continue

                # Read file w/o weird header
                logging.debug('Fixing: {}'.format(rel_path))
                f.seek(ORIGINAl_HEADER_OFFSET)
                original_content = f.read()
            # Save fixed content
            with open(rel_path, 'wb') as f:
                f.write(original_content)
            logging.info('Corrected file {}'.format(rel_path))

if __name__ == '__main__':
    argparser = argparse.ArgumentParser()
    argparser.add_argument('path', help='specify the path with files inside to work on')
    argparser.add_argument('-r', '--recursive', action='store_true', help='not only work in `path` but subdirectories as well')
    argparser.add_argument('-v', '--verbose', action='store_true', help='enable debug output')

    args = argparser.parse_args()

    if args.verbose:
        logging.getLogger().setLevel(logging.DEBUG)
    else:
        logging.getLogger().setLevel(logging.INFO)

    fix_files(args.path.rstrip('/'), args.recursive)

The script is very basic, without any error handling or parallelism, but it got the job done (after quite some time).