What is metadata?

Metadata is data about data. Every single digital artifact has it. It describes the who, what, when, where, how, and sometimes even, why, for any document, video, photo, or sound clip. This information comes in handy sometimes, like when you’re flipping through old pictures by date, or by location. But in the wrong hands, this same information could be damaging.

So, what does it look like?

Metadata exists in the parts of images, videos, or music that we can’t experience as humans. But if you pry into any digital artifact, you can see metadata as a list of keys (or tags) and their corresponding values. One of the simplest tags is “Creation Date,” which naturally points to the time when its creator pushed the shutter button, or pressed record. Other interesting tags include the “Make” and “Model” tags, which can tell you what type of camera or computer was used to create the media. There are dozens of such tags, and each one can help tell a very distinct story; this is why understanding how rich metadata is can better help protect the identities of sources who have shared their digital media with you.

Analysis with Exiftool

Exiftool is an open source software program that allows you to analyze, edit, and clear metadata. While it’s capable of handling multiple file types (images, videos, audio, text, etc.), it isn’t exceptionally capable of removing or overwriting metadata from files other than simple image formats. There are better tools and workflows to fully remove metadata, but we’ll get to this in another section.

In this section, let’s use Exiftool to explore metadata in more depth.

Example: a picture from Flickr (.jpg)

In this example, I was able to read the entire history of an image I posted to my Flickr account.

[email protected]:~$ exiftool idied.jpg 
ExifTool Version Number         : 10.71
File Name                       : idied.jpg
Directory                       : .
File Size                       : 170 kB
File Modification Date/Time     : 2018:01:04 01:06:30-05:00
File Access Date/Time           : 2018:01:04 01:06:31-05:00
File Inode Change Date/Time     : 2018:01:04 01:06:31-05:00
File Permissions                : rw-r--r--
File Type                       : JPEG
File Type Extension             : jpg
MIME Type                       : image/jpeg
JFIF Version                    : 1.01
Exif Byte Order                 : Little-endian (Intel, II)
Make                            : EASTMAN KODAK COMPANY
Camera Model Name               : KODAK EASYSHARE C653 ZOOM DIGITAL CAMERA
Orientation                     : Rotate 270 CW
X Resolution                    : 480
Y Resolution                    : 480
Resolution Unit                 : inches
Y Cb Cr Positioning             : Co-sited
Exposure Time                   : 1/13
F Number                        : 4.6
Exposure Program                : Program AE
ISO                             : 160
Exif Version                    : 0221
Date/Time Original              : 2006:01:09 07:25:05
Create Date                     : 2006:01:09 07:25:05
Components Configuration        : Y, Cb, Cr, -
Shutter Speed Value             : 1/13
Aperture Value                  : 4.8
Exposure Compensation           : 0
Max Aperture Value              : 4.8
Metering Mode                   : Multi-segment
Light Source                    : Unknown
Flash                           : Off, Did not fire
Focal Length                    : 18.0 mm
Serial Number                   : KCFGP71706722
Flashpix Version                : 0100
Color Space                     : sRGB
Exif Image Width                : 2848
Exif Image Height               : 2144
Interoperability Index          : R98 - DCF basic file (sRGB)
Interoperability Version        : 0100
Exposure Index                  : 160
Sensing Method                  : One-chip color area
File Source                     : Digital Camera
Scene Type                      : Directly photographed
Custom Rendered                 : Normal
Exposure Mode                   : Auto
White Balance                   : Auto
Digital Zoom Ratio              : 0
Focal Length In 35mm Format     : 108 mm
Scene Capture Type              : Standard
Gain Control                    : Low gain up
Contrast                        : Normal
Saturation                      : Normal
Sharpness                       : Normal
Subject Distance Range          : Unknown
Compression                     : JPEG (old-style)
Thumbnail Offset                : 12214
Thumbnail Length                : 5778
Image Width                     : 1280
Image Height                    : 963
Encoding Process                : Baseline DCT, Huffman coding
Bits Per Sample                 : 8
Color Components                : 3
Y Cb Cr Sub Sampling            : YCbCr4:2:0 (2 2)
Aperture                        : 4.6
Image Size                      : 1280x963
Megapixels                      : 1.2
Scale Factor To 35 mm Equivalent: 6.0
Shutter Speed                   : 1/13
Thumbnail Image                 : (Binary data 5778 bytes, use -b option to extract)
Circle Of Confusion             : 0.005 mm
Field Of View                   : 18.9 deg
Focal Length                    : 18.0 mm (35 mm equivalent: 108.0 mm)
Hyperfocal Distance             : 14.07 m
Light Value                     : 7.4

That’s a lot of information in just one photo! Among other things, we know that sometime in 2006 (imperfect timestamps notwithstanding), someone took a photo of me with my Kodak EasyShare camera. The lighting, lack of flash, and aperture are decisions the photographer made. Most importantly, you might notice the “Serial Number” tag-- it’s now a very public fact that I have indeed owned this camera in the early 2000’s.

Luckily, we can use this same tool to scrub all the personalizing metadata from the image.

[email protected]:~$ exiftool “-all=” idied.jpg

This command works well with .jpg images, but is not guaranteed to work for a lot of other file types. (So, keep reading!)

Example: a cool podcast from The Internet Archive (.mp3)

[email protected]:~$ exiftool RubenerdShow363.mp3 
ExifTool Version Number         : 10.71
File Name                       : RubenerdShow363.mp3
Directory                       : .
File Size                       : 23 MB
File Modification Date/Time     : 2018:01:03 14:11:11-05:00
File Access Date/Time           : 2018:01:03 21:29:45-05:00
File Inode Change Date/Time     : 2018:01:03 14:11:14-05:00
File Permissions                : rw-r--r--
File Type                       : MP3
File Type Extension             : mp3
MIME Type                       : audio/mpeg
MPEG Audio Version              : 1
Audio Layer                     : 3
Audio Bitrate                   : 128 kbps
Sample Rate                     : 44100
Channel Mode                    : Joint Stereo
MS Stereo                       : On
Intensity Stereo                : Off
Copyright Flag                  : False
Original Media                  : True
Emphasis                        : None
Encoder                         : LAME3.99r
Lame VBR Quality                : 4
Lame Quality                    : 0
Lame Method                     : CBR
Lame Low Pass Filter            : 17 kHz
Lame Bitrate                    : 128 kbps
Lame Stereo Mode                : Joint Stereo
ID3 Size                        : 57034
Release Time                    : 2017
Original Release Time           : 2017:07:14
Recording Time                  : 2017:07:14
Encoding Time                   : 2017:07:14
Tagging Time                    : 2017:07:14
Picture MIME Type               : image/png
Picture Type                    : Front Cover
Picture Description             : 
Picture                         : (Binary data 54706 bytes, use -b option to extract)
Lyrics                          : (SHOWNOTES) 25:22 Join Ruben as he harkens back to one of the first reboot episodes in 2015, when he was also wandering around an empty house that was once his home. Two years later, and he's moving out of the place he moved away from that earlier place to. This show description had several variants of the word move in it. Recorded 3rd of July 2017...Recorded in Sydney, Australia. Licence for this track: Creative Commons Attribution 3.0. Attribution: Ruben Schade...Released July 2017 on Rubnerd and The Overnightscape Underground, an Internet talk radio channel focusing on a freeform monologue style, with diverse and fascinating hosts...
Track                           : 363
Artist                          : Ruben Schade
Album                           : Rubnerd Show
Band                            : Ruben Schade
Title                           : 363: The everything except episode
Genre                           : New Time Radio
Publisher                       : Ruben Schade
Internet Radio Station Name     : Overnightscape Underground
Internet Radio Station Owner    : Frank Edward Nora
File URL                        : https://archive[.]org/download/RubenerdShow363/RubenerdShow363.mp3
Artist URL                      : https://rubenerd[.]com/
Source URL                      : https://rubenerd[.]com/show363/
Internet Radio Station URL      : https://onsug[.]com/
Copyright URL                   : http://creativecommons[.]org/licenses/by/3.0/
Publisher URL                   : https://rubenerd[.]com/show/
Date/Time Original              : 2017:07:14
Duration                        : 0:25:18 (approx)

Example: a PDF from an office scanner (.pdf)

[email protected]:~$ exiftool Anonymous\ Witness\ 1\,\ Union\ Laborer_3.13.90.pdf 
ExifTool Version Number         : 10.71
File Name                       : Anonymous Witness 1, Union Laborer_3.13.90.pdf
Directory                       : .
File Size                       : 1849 kB
File Modification Date/Time     : 2017:12:15 04:53:38-05:00
File Access Date/Time           : 2017:12:15 04:53:38-05:00
File Inode Change Date/Time     : 2018:01:04 01:22:47-05:00
File Permissions                : rw-r--r--
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.4
Linearized                      : No
Creator                         : KMBT_283
Producer                        : KONICA MINOLTA bizhub 283
Create Date                     : 2017:02:14 17:58:02-05:00
Page Count                      : 8

Do you notice the “Creator” and “Producer” tags? It might be possible to pinpoint exactly where in a certain office building a document is created by investigating the right information.

So now you know.

Once again, Exiftool is best as a sanity check. It’s always great to return to this tool to verify that you’ve scrubbed all the possible metadata via other methods. So, now that we understand what metadata looks like, how do we safely remove metadata?

Managing media metadata

Using the MAT

If you’re a linux user, the Metadata Anonymisation Toolkit, or MAT is a great tool to help you scrub metadata. This tool works really well for a number of file types, like .jpg, .mp3, .flac, and other common media types. If your desired file type isn’t supported, the program will notify you.

While MAT has a command-line interface, some users might find it easier to use the simple drag-and-drop interface. Either way works.

Metadata removal is a destructive edit! If you need to preserve the original metadata, be sure to make derivative copies of your media before proceeding.

Drag-and-drop

Navigate to MAT from Applications->System Tools. Drag and drop your files into the window (or click “Add” to select them from a finder window.)

Select all your files, and click “Clean”. After awhile, the file’s state will change from either “Dirty” or “Unknown” to “Clean”.

Command-line ballers

You can do the same via the command line. Open up terminal and input:

[email protected]:~$ mat /path/to/dirty/image.jpg

Using FFmpeg

FFmpeg is a much-loved audio-visual swiss army knife that helps users manipulate rich media file types, like .mp4, .mov, .mkv, and .wmv. With FFmpeg, making a metadata-free copy of your original file is as simple as running:

ffmpeg -i /path/to/original/file.mp4 -map_metadata -1 -c:v copy -c:a copy /path/to/clean/clone.mp4

Bad news about Word docs, PDFs, etc.

The aforementioned tools work really well with visual and audio media, but text documents are unfortunately much more complex. Documents like .docx, .xlsx, .pdf, .ppt, and others usually contain multiple embedded images, videos, and other media files. They’re kind of like nesting dolls. So, while it’s possible to scrub basic metadata tags from any of these documents, the objects embedded within them have a so much metadata of their own that can be individually scrutinized. This makes the idea of software-based retraction somewhat foolish.

Here’s an example: Using another open source tool called Peepdf, we’re able to see all the different objects (like images) embedded into any.pdf file. So, even if we were to strip the metadata from the document itself, anyone can extract any of its individual embedded images, and parse their metadata for more identifying context using any of the aforementioned methods. (Also, did I forget to mention, that embedded images could be extremely tiny, and non-visible to the naked eye?)

Instead, it’s best to recreate the document by flattening all the embedded objects before exporting and sharing it. For these types of documents, that means either printing them out, then rescanning; or exporting them to a different format altogether.

First Look Media’s PDF Redact Tools is a great PDF flattening tool. It automates metadata removal by creating an image of each page within a document, and gluing them back together into a brand new PDF. While this is a fabulous tool, here are two downsides: the resulting PDF is usually a lot larger than its original, which might make export and sharing more cumbersome; and it relies upon a library, ImageMagick, with a somewhat buggy history. That said, PDF Redact Tools is incredibly easy to work with, and does an excellent job at metadata removal. If you can install it on a dedicated, sandboxed machine, it makes a great tool to have in your toolkit.

If you’re interested in doing named entity recognition (NER), word frequencies, or just better searching within text, a flattened PDF file will be hard to work with because all the text will now be image-based. Thankfully, tools exist to “read” images into workable text, like Tesseract. You can explode a flattened PDF into individual images of the pages using PDF Redact Tools, then feed the pages into Tesseract to create a text document that can be worked on with any language processing tool. Beware, however, that the optical character recognition is imperfect, and you might have to comb through the resulting text to fix typos. The english dictionary data is installed by default, but other language packs are available. Visit the Tesseract wiki for more information.

Other redaction tools

If you don’t have Photoshop, you might enjoy GIMP, its open source alternative, which can be used for performing visual redactions to PDFs and other documents.

Audacity is an audio toolkit that allows you to splice audio to your liking. I find it’s the perfect tool for editing interviews that may contain off-the-record statements.

Be aware that these types of edits are non-destructive, meaning that metadata, project history, and original artifacts can be uncovered by forensic analysis. Using GIMP and Audacity is a great way to perform audio and visual redactions, but you should still take care to flatten your media before publishing by “jumping the analog hole” and using Exiftool to verify you’ve done it correctly.

The Analog Hole

Although there are a number of really great software tools to help understand, manipulate, and scrub metadata, nothing is perfect. As we explored in the previous section, digital forensic specialists might still be able to uncover bits of history from the bytes in any digital artifact. One creative way to be sure that original metadata is inaccessible is to recreate the original through “the analog hole”.

Have you ever bought a bootleg movie? (It’s ok, no judgements!) If you have, you might remember that those movies were created by someone sneaking into the theater with their own camcorder, and simply taping the entire movie from their seat. That’s an example of the analog hole; and you can use similar tactics to create unattributable copies of your original media.

Some ideas for jumping through the analog hole

Images Take a screenshot from your computer and publish that instead.
Video Use the QuickTime tool to capture a movie from your screen as it plays.
Audio Purchase an audio loopback cable, and play an MP3 directly into a digital recorder. Or, purchase a USB adapter to record audio input directly into your computer.
Office Documents/PDFs Copy the text into a new document. Print the replicated document, and re-scan it into your computer.

Caveats Galore

Once again, nothing is ever perfect. Even the analog hole might lead to some trouble. For example, a well-known tactic in the intelligence community is to create several, almost identical copies of the same document, each one containing minute typos. That way, if a sensitive document finds itself published in the press, the whistleblower would be identified because the printed document would contain the tell-tale typo. This is a clear example of the myriad ways a source may still be compromised despite the great consideration and care you have taken to protect their digital assets. Please be mindful of this when working with submissions.

Setting up a media workstation

With the exception of MAT, which was written for Debian Linux, all the tools mentioned here can be used on most operating systems. While you could easily turn your computer into a powerful, metadata-crunching workhorse, take your own privacy, security, and confidentiality into consideration. You may be working with extra-sensitive material, so it might not be wise to handle it on your day-to-day machine.

I find it easiest to juggle these considerations by compartmentalizing my workspace; having a dedicated space to prod, cut, copy, and paste gives me more confidence in my ability to handle sensitive media safely and sanely. I build myself a sandbox: a somewhat safe place to do somewhat dangerous things.

Tails

Tails (The Amnesic Incognito Live System), is a fully self-contained computer that lives on a USB stick. To use it, simply plug the specially formatted USB stick into any computer, and instruct it to boot up from it, instead of your normal operating system (i.e. Mac or Windows). When you boot into it, you enable a session to do whatever you want to do, in relative safety. Once you shut down, all traces of your session are erased. This makes it an ideal sandbox.

Tails is an almost perfect choice for a media workstation, as it comes with MAT, Exiftool, PDF Redact Tools, Gimp, and Audacity right out-of-the-box. For packages that aren’t installed on Tails by default, you will have to start a Tails session with admin privileges and download and install the appropriate software.

For example, if you want to install FFmpeg, open a terminal window and input:

sudo apt-get update
sudo apt-get install ffmpeg

Installing persistent software in Tails

Remember, Tails is amnesic; once you end a session, all files and programs that weren’t originally included in Tails will be lost. However, there is a way to enable persistence on your Tails USB stick so you can install extra software, manage projects, etc. across reboots. Follow the instructions from the Tails website for enabling persistence before installing things like FFmpeg, and starting more advanced projects.

Once you have persistence enabled, reboot Tails and start a session with admin privileges. If your desired software is already packaged for Debian, it should be simple to install within Tails using apt. Once installed, simply add the package name to your additional software manifest, located at:
/live/persistence/TailsData_unlocked/live-additional-software.conf

For instance, installing Tesseract is pretty simple. Since it has a repository in Debian, it’s as easy as running:

[email protected]:~$ sudo apt update
[email protected]:~$ sudo apt install tesseract-ocr
[email protected]:~$ echo “tesseract-ocr” | sudo tee -a /live/persistence/TailsData_unlocked/live-additional-software.conf > /dev/null

Note: additional software takes some time to become available across reboots. This is because Tails must re-install each program according to a config file at  . Please be patient, and wait for the notification reading “Your additional software are installed” before attempting to use any special programs.

Photo by Sebastiaan ter Burg. CC-BY-2.0