What is metadata?

Metadata is data about data. Every single digital artifact has it. It describes the who, what, when, where, how, and sometimes even why, for any document, video, photo, or sound clip. This information comes in handy sometimes, like when you’re flipping through old pictures by date or location. But in the wrong hands, this same information could be damaging. We’ll take a crash course in some of the tools you can use to analyze, manipulate, and scrub media metadata.

So, what does it look like?

Metadata exists in the parts of images, videos, or music that we can’t experience as humans. But if you pry into any digital artifact, you can see metadata as a list of keys (or tags) and their corresponding values. One of the simplest tags is “creation date,” which points to the time when its creator pushed the shutter button or pressed record. Other interesting tags include the “make” and “model” tags, which can tell you what type of camera or computer was used to create the media. There are dozens of such tags, and each one can help tell a very distinct story. This is why understanding how rich metadata is can better help protect the identities of sources who have shared their digital media with you.

Setting up a media workstation sandbox

Most of the tools mentioned in this guide can be used on most computer operating systems. While you could easily turn your computer into a powerful, metadata-crunching workhorse, take your own privacy and security into consideration. You may be working with extra-sensitive material, so it might not be wise to handle it on your day-to-day machine.

I find it easiest to juggle these considerations by compartmentalizing my workspace; having a dedicated space to prod, cut, copy, and paste gives me more confidence in my ability to handle sensitive media safely and sanely. I build myself a “sandbox”; a somewhat safe place to do somewhat dangerous things.

Tails

Tails, or The Amnesic Incognito Live System, is a fully self-contained computer that lives on a USB drive. To use it, install Tails on a blank USB drive and plug it into any PC or Mac. You’ll need to instruct your computer to boot up from USB, instead of your normal operating system (i.e., macOS or Windows). When you boot into it, you enable a session to do whatever you want to do, in relative safety. When you connect to the internet in Tails, your network traffic is anonymized using Tor, although you can also choose not to connect to the internet at all. Once you shut down, all traces of your session are erased. This makes it an ideal sandbox.

Tails is an almost perfect choice for a media workstation, as it comes with useful tools for manipulating metadata that we will explain in this article, like Metadata Cleaner, ExifTool, Gimp, and Audacity, all right out of the box. For software packages that aren’t installed on Tails by default, you will have to start a Tails session with a set admin password, and then download and install the appropriate software.

For example, if you want to install VLC, an application that not only can play video and audio but also convert them to different formats, first connect to the internet and wait for Tor to get ready. Then, navigate to: Apps > System Tools > Synaptic Package Manager and use the search feature to look for “vlc.” Click on the checkbox next to the search result and choose “Mark for installation.” Confirm the installation for any prerequisite packages by clicking “Mark,” then clicking the “Apply” icon, and then clicking the “Apply” button when prompted to confirm.

Once the installation is complete, Tails will ask if you want to install the selected application for just this session or all sessions going forward (with persistence enabled). Let’s talk about the latter option.

Keeping installed software in Tails after rebooting

Remember, Tails is amnesic; once you end a session, all files and software that weren’t originally included in Tails will be lost. However, there is a way to enable persistence on your Tails USB drive so you can install extra software, manage projects, etc., between reboots. Follow the instructions from the Tails website for enabling persistence before installing new programs or starting more advanced projects.

Additional software takes some time to be available across reboots. This is because Tails must reinstall each program at the beginning of a new session. Please be patient and wait for the notification that reads, “Additional software installed successfully” before attempting to use any additional programs.

Exploring metadata with metadata tools

Analysis with ExifTool

ExifTool is an open source software program that allows you to analyze, edit, and clear metadata. While it’s capable of handling multiple file types (images, videos, audio, text, etc.), it isn’t as good at removing or overwriting metadata from files other than simple image formats. There are better tools and workflows to fully remove metadata, but we’ll get to that in another section.

In this section, let’s use ExifTool to explore metadata in more depth.

Example: a picture from Flickr (.jpg)

In this example, I was able to read the entire history of an image I posted to my Flickr account.

me@computer:~$ exiftool idied.jpg 
ExifTool Version Number         : 10.71
File Name                       : idied.jpg
Directory                       : .
File Size                       : 170 kB
File Modification Date/Time     : 2018:01:04 01:06:30-05:00
File Access Date/Time           : 2018:01:04 01:06:31-05:00
File Inode Change Date/Time     : 2018:01:04 01:06:31-05:00
File Permissions                : rw-r--r--
File Type                       : JPEG
File Type Extension             : jpg
MIME Type                       : image/jpeg
JFIF Version                    : 1.01
Exif Byte Order                 : Little-endian (Intel, II)
Make                            : EASTMAN KODAK COMPANY
Camera Model Name               : KODAK EASYSHARE C653 ZOOM DIGITAL CAMERA
Orientation                     : Rotate 270 CW
X Resolution                    : 480
Y Resolution                    : 480
Resolution Unit                 : inches
Y Cb Cr Positioning             : Co-sited
Exposure Time                   : 1/13
F Number                        : 4.6
Exposure Program                : Program AE
ISO                             : 160
Exif Version                    : 0221
Date/Time Original              : 2006:01:09 07:25:05
Create Date                     : 2006:01:09 07:25:05
Components Configuration        : Y, Cb, Cr, -
Shutter Speed Value             : 1/13
Aperture Value                  : 4.8
Exposure Compensation           : 0
Max Aperture Value              : 4.8
Metering Mode                   : Multi-segment
Light Source                    : Unknown
Flash                           : Off, Did not fire
Focal Length                    : 18.0 mm
Serial Number                   : KCFGP71706722
Flashpix Version                : 0100
Color Space                     : sRGB
Exif Image Width                : 2848
Exif Image Height               : 2144
Interoperability Index          : R98 - DCF basic file (sRGB)
Interoperability Version        : 0100
Exposure Index                  : 160
Sensing Method                  : One-chip color area
File Source                     : Digital Camera
Scene Type                      : Directly photographed
Custom Rendered                 : Normal
Exposure Mode                   : Auto
White Balance                   : Auto
Digital Zoom Ratio              : 0
Focal Length In 35mm Format     : 108 mm
Scene Capture Type              : Standard
Gain Control                    : Low gain up
Contrast                        : Normal
Saturation                      : Normal
Sharpness                       : Normal
Subject Distance Range          : Unknown
Compression                     : JPEG (old-style)
Thumbnail Offset                : 12214
Thumbnail Length                : 5778
Image Width                     : 1280
Image Height                    : 963
Encoding Process                : Baseline DCT, Huffman coding
Bits Per Sample                 : 8
Color Components                : 3
Y Cb Cr Sub Sampling            : YCbCr4:2:0 (2 2)
Aperture                        : 4.6
Image Size                      : 1280x963
Megapixels                      : 1.2
Scale Factor To 35 mm Equivalent: 6.0
Shutter Speed                   : 1/13
Thumbnail Image                 : (Binary data 5778 bytes, use -b option to extract)
Circle Of Confusion             : 0.005 mm
Field Of View                   : 18.9 deg
Focal Length                    : 18.0 mm (35 mm equivalent: 108.0 mm)
Hyperfocal Distance             : 14.07 m
Light Value                     : 7.4

That’s a lot of information in just one photo! Among other things, we know that sometime in 2006 (imperfect time stamps notwithstanding), someone took a photo of me with my Kodak EasyShare camera. The lighting, lack of flash, and aperture are decisions the photographer made. Most importantly, you might notice the “serial number” tag — it’s now a very public fact that I indeed owned this camera in the early 2000s.

Luckily, we can use this same tool to scrub all the personalizing metadata from the image.

me@computer:~$ exiftool -all= idied.jpg

This command works well with .jpg images, but is not guaranteed to work for a lot of other file types. (So keep reading!)

Example: A cool podcast from the Internet Archive (.mp3)

me@computer:~$ exiftool RubenerdShow363.mp3 
ExifTool Version Number         : 10.71
File Name                       : RubenerdShow363.mp3
Directory                       : .
File Size                       : 23 MB
File Modification Date/Time     : 2018:01:03 14:11:11-05:00
File Access Date/Time           : 2018:01:03 21:29:45-05:00
File Inode Change Date/Time     : 2018:01:03 14:11:14-05:00
File Permissions                : rw-r--r--
File Type                       : MP3
File Type Extension             : mp3
MIME Type                       : audio/mpeg
MPEG Audio Version              : 1
Audio Layer                     : 3
Audio Bitrate                   : 128 kbps
Sample Rate                     : 44100
Channel Mode                    : Joint Stereo
MS Stereo                       : On
Intensity Stereo                : Off
Copyright Flag                  : False
Original Media                  : True
Emphasis                        : None
Encoder                         : LAME3.99r
Lame VBR Quality                : 4
Lame Quality                    : 0
Lame Method                     : CBR
Lame Low Pass Filter            : 17 kHz
Lame Bitrate                    : 128 kbps
Lame Stereo Mode                : Joint Stereo
ID3 Size                        : 57034
Release Time                    : 2017
Original Release Time           : 2017:07:14
Recording Time                  : 2017:07:14
Encoding Time                   : 2017:07:14
Tagging Time                    : 2017:07:14
Picture MIME Type               : image/png
Picture Type                    : Front Cover
Picture Description             : 
Picture                         : (Binary data 54706 bytes, use -b option to extract)
Lyrics                          : (SHOWNOTES) 25:22 Join Ruben as he harkens back to one of the first reboot episodes in 2015, when he was also wandering around an empty house that was once his home. Two years later, and he's moving out of the place he moved away from that earlier place to. This show description had several variants of the word move in it. Recorded 3rd of July 2017...Recorded in Sydney, Australia. Licence for this track: Creative Commons Attribution 3.0. Attribution: Ruben Schade...Released July 2017 on Rubnerd and The Overnightscape Underground, an Internet talk radio channel focusing on a freeform monologue style, with diverse and fascinating hosts...
Track                           : 363
Artist                          : Ruben Schade
Album                           : Rubnerd Show
Band                            : Ruben Schade
Title                           : 363: The everything except episode
Genre                           : New Time Radio
Publisher                       : Ruben Schade
Internet Radio Station Name     : Overnightscape Underground
Internet Radio Station Owner    : Frank Edward Nora
File URL                        : https://archive[.]org/download/RubenerdShow363/RubenerdShow363.mp3
Artist URL                      : https://rubenerd[.]com/
Source URL                      : https://rubenerd[.]com/show363/
Internet Radio Station URL      : https://onsug[.]com/
Copyright URL                   : http://creativecommons[.]org/licenses/by/3.0/
Publisher URL                   : https://rubenerd[.]com/show/
Date/Time Original              : 2017:07:14
Duration                        : 0:25:18 (approx)

Example: A PDF from an office scanner (.pdf)

me@computer:~$ exiftool Anonymous\ Witness\ 1\,\ Union\ Laborer_3.13.90.pdf 
ExifTool Version Number         : 10.71
File Name                       : Anonymous Witness 1, Union Laborer_3.13.90.pdf
Directory                       : .
File Size                       : 1849 kB
File Modification Date/Time     : 2017:12:15 04:53:38-05:00
File Access Date/Time           : 2017:12:15 04:53:38-05:00
File Inode Change Date/Time     : 2018:01:04 01:22:47-05:00
File Permissions                : rw-r--r--
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.4
Linearized                      : No
Creator                         : KMBT_283
Producer                        : KONICA MINOLTA bizhub 283
Create Date                     : 2017:02:14 17:58:02-05:00
Page Count                      : 8

Do you notice the “creator” and “producer” tags? It might be possible to pinpoint exactly where in a certain office building a document is created by investigating the right information.

So now you know.

Once again, ExifTool is best as a sanity check. It’s always great to return to this tool to verify that you’ve scrubbed all the possible metadata via other methods. So now that we understand what metadata looks like, how do we safely remove metadata?

Managing media metadata

Using the Metadata Cleaner

If you’re a Linux user, the Metadata Anonymisation Toolkit is a great means to help scrub metadata. MAT works really well for a number of file types, like .jpg, .mp3, .flac, and other common media types. In Tails, the Metadata Cleaner application provides a user-friendly interface for MAT.

To use Metadata Cleaner, navigate to Apps > Accessories in Tails (or other flavors of Linux that use Gnome). Select the “Add Files” button on the top left. Find and select the file you want to clean, and click “Clean.” This will replace the existing file with a new, cleaned-up copy of the file, deleting the original.

You can do the same via the command line. Navigate to: Apps > System Tools > Console and input:

amnesia@amnesia:~$ mat2 /path/to/dirty/image.jpg

Using FFmpeg

FFmpeg is a much-loved audiovisual Swiss Army knife that helps users manipulate rich media file types, like .mp4, .mov, .mkv, and .wmv. With FFmpeg, making a metadata-free copy of your original file is as simple as running:

ffmpeg -i /path/to/original/file.mp4 -map_metadata -1 -c:v copy -c:a copy /path/to/clean/clone.mp4

Bad news about Word docs, PDFs, etc.

The aforementioned tools work really well with visual and audio media, but text documents are, unfortunately, much more complex. Documents like .docx, .xlsx, .pdf, .ppt, and others usually contain multiple embedded images, videos, and other media files. They’re kind of like nesting dolls. So, while it’s possible to scrub basic metadata tags from any of these documents, the objects embedded within them have so much metadata of their own that can be individually scrutinized. This makes the idea of retraction with most software somewhat foolish.

Here’s an example: Using another open source tool called Peepdf, we’re able to see all the different objects (like images) embedded into any .pdf file. So, even if we were to strip the metadata from the document itself, anyone can extract any of its individual embedded images and parse their metadata for more identifying context using any of the methods we’ve discussed. (Also, did I forget to mention that embedded images could be extremely tiny and not visible to the naked eye?)

Instead, it’s best to recreate the document by flattening all the embedded objects before exporting and sharing it. For these types of documents, that means either printing them out and rescanning, or exporting them to a different format altogether.

If you’re interested in named entity recognition, word frequencies, or just better searching within text, a flattened PDF file will be hard to work with, because all the text will now be image-based. Thankfully, tools like Tesseract exist to “read” images into workable text. You can explode a flattened PDF into individual images of the pages using a scanner, then feed the pages into Tesseract to create a text document that can be worked on with any language processing tool. Beware, however, that the optical character recognition is imperfect, and you might have to comb through the resulting text to fix typos. The English dictionary data is installed by default, but other language data files are available.

Other tools

If you don’t have Photoshop, you may find use in its open source alternative, the GNU Image Manipulation Program, which can be used for performing visual redactions to PDFs and other documents.

Audacity is an audio tool kit that allows you to splice audio to your liking. I find it’s the perfect tool for editing interviews that may contain off-the-record statements.

Dangerzone, a free and open source project maintained by Freedom of the Press Foundation (FPF), allows you to create safe PDFs from potentially unsafe PDF, image, or document files. It’s easy to use and runs on macOS, Windows, and Linux.

On the go? Signal allows you to send a scrubbed copy of an image to yourself with its “Note to Self” feature.

Be aware that these types of edits are nondestructive, meaning that metadata, project history, and artifacts in the original files can be uncovered by forensic analysis. Using GIMP and Audacity is a great way to perform audio and visual redactions. But you should still take care to flatten your media before publishing by using ExifTool to verify you’ve done it correctly, and by “jumping the analog hole.”

The analog hole

Although there are a number of really great software tools to help understand, manipulate, and scrub metadata, nothing is perfect. As we explored in the previous section, digital forensic specialists might still be able to uncover bits of history from the bytes in any digital artifact.

One creative way to be sure that original metadata is inaccessible is to recreate the original through “the analog hole.”

Have you ever bought a bootleg movie? (It’s OK, no judgments!) If you have, you might remember that those movies were created by someone sneaking into the theater with their own camcorder and simply taping the entire movie from their seat. That’s an example of the “analog hole,” and you can use similar tactics to create totally separate copies of your original media without having to touch the original file or its metadata at all.

Some ideas for jumping through the analog hole

Caveats Galore

Once again, nothing is ever perfect. Even the analog hole might lead to some trouble. For example, a well-known tactic in the intelligence community is to create several, almost identical copies of the same document, each one containing minute typos. That way, if a sensitive document finds itself published in the press, the whistleblower would be identified because the printed document would contain the telltale typo. This is a clear example of the myriad ways a source may still be compromised despite the great consideration and care you have taken to protect their digital assets. Please be mindful of this when working with submissions.