Harlo Holmes

Chief Information Security Officer and Director of Digital Security

Last updated

What is metadata?

Metadata is data about data. Every single digital artifact has it. It describes the who, what, when, where, how, and sometimes even, why, for any document, video, photo, or sound clip. This information comes in handy sometimes, like when you’re flipping through old pictures by date, or by location. But in the wrong hands, this same information could be damaging.

So, what does it look like?

Metadata exists in the parts of images, videos, or music that we can’t experience as humans. But if you pry into any digital artifact, you can see metadata as a list of keys (or tags) and their corresponding values. One of the simplest tags is “Creation Date,” which naturally points to the time when its creator pushed the shutter button, or pressed record. Other interesting tags include the “Make” and “Model” tags, which can tell you what type of camera or computer was used to create the media. There are dozens of such tags, and each one can help tell a very distinct story; this is why understanding how rich metadata is can better help protect the identities of sources who have shared their digital media with you.

Setting up a media workstation sandbox

Most of the tools mentioned in this guide can be used on most computer operating systems. While you could easily turn your computer into a powerful, metadata-crunching workhorse, take your own privacy and security into consideration. You may be working with extra-sensitive material, so it might not be wise to handle it on your day-to-day machine.

I find it easiest to juggle these considerations by compartmentalizing my workspace; having a dedicated space to prod, cut, copy, and paste gives me more confidence in my ability to handle sensitive media safely and sanely. I build myself a “sandbox”: a somewhat safe place to do somewhat dangerous things.

Tails

Tails (The Amnesic Incognito Live System), is a fully self-contained computer that lives on a USB drive. To use it, install Tails on a blank USB drive and plug it in to any PC or Mac. You’ll need to instruct your computer to boot up from USB, instead of your normal operating system (i.e. macOS or Windows). When you boot into it, you enable a session to do whatever you want to do, in relative safety. Once you shut down, all traces of your session are erased. This makes it an ideal sandbox.

Tails is an almost perfect choice for a media workstation, as it comes with tools like MAT, Exiftool, Gimp, and Audacity right out-of-the-box. For software packages that aren’t installed on Tails by default, you will have to start a Tails session with a set admin password and then download and install the appropriate software.

For example, if you want to install a PDF cleanup tool like First Look Media’s PDF Redact Tools in Tails, first connect to the internet and wait for Tor to get read. Then, navigate to: Applications > System Tools > Synaptic Package Manager and use the search feature to look for “pdf redact tools.”

Once the installation is complete, Tails will ask if you want to install the selected application for just this session, or all sessions (with persistence enabled) going forward. Let’s talk about the latter option...

Keeping installed software in Tails after rebooting

Remember, Tails is amnesic; once you end a session, all files and software that weren’t originally included in Tails will be lost. However, there is a way to enable persistence on your Tails USB drive so you can install extra software, manage projects, etc. between reboots. Follow the instructions from the Tails website for enabling persistence before installing new programs or starting more advanced projects.

Additional software takes some time to be available across reboots. This is because Tails must re-install each program at the beginning of a new session. Please be patient, and wait for the notification reading “Your additional software are installed” before attempting to use any additional programs.

Exploring metadata with metadata tools

Analysis with Exiftool

Note: As of March 1st, 2022, the version of Exiftool available in Tails 4.27, Exiftool 11.16, has not yet been updated to address a recent security vulnerability discovered in its codebase. If you intend to use Exiftool with untrusted documents, we recommend using Exiftool 12.24 or above.

Exiftool is an open source software program that allows you to analyze, edit, and clear metadata. While it’s capable of handling multiple file types (images, videos, audio, text, etc.), it isn’t exceptionally capable of removing or overwriting metadata from files other than simple image formats. There are better tools and workflows to fully remove metadata, but we’ll get to this in another section.

In this section, let’s use Exiftool to explore metadata in more depth.

Example: a picture from Flickr (.jpg)

In this example, I was able to read the entire history of an image I posted to my Flickr account.

me@computer:~$ exiftool idied.jpg 
ExifTool Version Number         : 10.71
File Name                       : idied.jpg
Directory                       : .
File Size                       : 170 kB
File Modification Date/Time     : 2018:01:04 01:06:30-05:00
File Access Date/Time           : 2018:01:04 01:06:31-05:00
File Inode Change Date/Time     : 2018:01:04 01:06:31-05:00
File Permissions                : rw-r--r--
File Type                       : JPEG
File Type Extension             : jpg
MIME Type                       : image/jpeg
JFIF Version                    : 1.01
Exif Byte Order                 : Little-endian (Intel, II)
Make                            : EASTMAN KODAK COMPANY
Camera Model Name               : KODAK EASYSHARE C653 ZOOM DIGITAL CAMERA
Orientation                     : Rotate 270 CW
X Resolution                    : 480
Y Resolution                    : 480
Resolution Unit                 : inches
Y Cb Cr Positioning             : Co-sited
Exposure Time                   : 1/13
F Number                        : 4.6
Exposure Program                : Program AE
ISO                             : 160
Exif Version                    : 0221
Date/Time Original              : 2006:01:09 07:25:05
Create Date                     : 2006:01:09 07:25:05
Components Configuration        : Y, Cb, Cr, -
Shutter Speed Value             : 1/13
Aperture Value                  : 4.8
Exposure Compensation           : 0
Max Aperture Value              : 4.8
Metering Mode                   : Multi-segment
Light Source                    : Unknown
Flash                           : Off, Did not fire
Focal Length                    : 18.0 mm
Serial Number                   : KCFGP71706722
Flashpix Version                : 0100
Color Space                     : sRGB
Exif Image Width                : 2848
Exif Image Height               : 2144
Interoperability Index          : R98 - DCF basic file (sRGB)
Interoperability Version        : 0100
Exposure Index                  : 160
Sensing Method                  : One-chip color area
File Source                     : Digital Camera
Scene Type                      : Directly photographed
Custom Rendered                 : Normal
Exposure Mode                   : Auto
White Balance                   : Auto
Digital Zoom Ratio              : 0
Focal Length In 35mm Format     : 108 mm
Scene Capture Type              : Standard
Gain Control                    : Low gain up
Contrast                        : Normal
Saturation                      : Normal
Sharpness                       : Normal
Subject Distance Range          : Unknown
Compression                     : JPEG (old-style)
Thumbnail Offset                : 12214
Thumbnail Length                : 5778
Image Width                     : 1280
Image Height                    : 963
Encoding Process                : Baseline DCT, Huffman coding
Bits Per Sample                 : 8
Color Components                : 3
Y Cb Cr Sub Sampling            : YCbCr4:2:0 (2 2)
Aperture                        : 4.6
Image Size                      : 1280x963
Megapixels                      : 1.2
Scale Factor To 35 mm Equivalent: 6.0
Shutter Speed                   : 1/13
Thumbnail Image                 : (Binary data 5778 bytes, use -b option to extract)
Circle Of Confusion             : 0.005 mm
Field Of View                   : 18.9 deg
Focal Length                    : 18.0 mm (35 mm equivalent: 108.0 mm)
Hyperfocal Distance             : 14.07 m
Light Value                     : 7.4

That’s a lot of information in just one photo! Among other things, we know that sometime in 2006 (imperfect timestamps notwithstanding), someone took a photo of me with my Kodak EasyShare camera. The lighting, lack of flash, and aperture are decisions the photographer made. Most importantly, you might notice the “Serial Number” tag — it’s now a very public fact that I have indeed owned this camera in the early 2000’s.

Luckily, we can use this same tool to scrub all the personalizing metadata from the image.

me@computer:~$ exiftool -all= idied.jpg

This command works well with .jpg images, but is not guaranteed to work for a lot of other file types. (So, keep reading!)

Example: a cool podcast from The Internet Archive (.mp3)

me@computer:~$ exiftool RubenerdShow363.mp3 
ExifTool Version Number         : 10.71
File Name                       : RubenerdShow363.mp3
Directory                       : .
File Size                       : 23 MB
File Modification Date/Time     : 2018:01:03 14:11:11-05:00
File Access Date/Time           : 2018:01:03 21:29:45-05:00
File Inode Change Date/Time     : 2018:01:03 14:11:14-05:00
File Permissions                : rw-r--r--
File Type                       : MP3
File Type Extension             : mp3
MIME Type                       : audio/mpeg
MPEG Audio Version              : 1
Audio Layer                     : 3
Audio Bitrate                   : 128 kbps
Sample Rate                     : 44100
Channel Mode                    : Joint Stereo
MS Stereo                       : On
Intensity Stereo                : Off
Copyright Flag                  : False
Original Media                  : True
Emphasis                        : None
Encoder                         : LAME3.99r
Lame VBR Quality                : 4
Lame Quality                    : 0
Lame Method                     : CBR
Lame Low Pass Filter            : 17 kHz
Lame Bitrate                    : 128 kbps
Lame Stereo Mode                : Joint Stereo
ID3 Size                        : 57034
Release Time                    : 2017
Original Release Time           : 2017:07:14
Recording Time                  : 2017:07:14
Encoding Time                   : 2017:07:14
Tagging Time                    : 2017:07:14
Picture MIME Type               : image/png
Picture Type                    : Front Cover
Picture Description             : 
Picture                         : (Binary data 54706 bytes, use -b option to extract)
Lyrics                          : (SHOWNOTES) 25:22 Join Ruben as he harkens back to one of the first reboot episodes in 2015, when he was also wandering around an empty house that was once his home. Two years later, and he's moving out of the place he moved away from that earlier place to. This show description had several variants of the word move in it. Recorded 3rd of July 2017...Recorded in Sydney, Australia. Licence for this track: Creative Commons Attribution 3.0. Attribution: Ruben Schade...Released July 2017 on Rubnerd and The Overnightscape Underground, an Internet talk radio channel focusing on a freeform monologue style, with diverse and fascinating hosts...
Track                           : 363
Artist                          : Ruben Schade
Album                           : Rubnerd Show
Band                            : Ruben Schade
Title                           : 363: The everything except episode
Genre                           : New Time Radio
Publisher                       : Ruben Schade
Internet Radio Station Name     : Overnightscape Underground
Internet Radio Station Owner    : Frank Edward Nora
File URL                        : https://archive[.]org/download/RubenerdShow363/RubenerdShow363.mp3
Artist URL                      : https://rubenerd[.]com/
Source URL                      : https://rubenerd[.]com/show363/
Internet Radio Station URL      : https://onsug[.]com/
Copyright URL                   : http://creativecommons[.]org/licenses/by/3.0/
Publisher URL                   : https://rubenerd[.]com/show/
Date/Time Original              : 2017:07:14
Duration                        : 0:25:18 (approx)

Example: a PDF from an office scanner (.pdf)

me@computer:~$ exiftool Anonymous\ Witness\ 1\,\ Union\ Laborer_3.13.90.pdf 
ExifTool Version Number         : 10.71
File Name                       : Anonymous Witness 1, Union Laborer_3.13.90.pdf
Directory                       : .
File Size                       : 1849 kB
File Modification Date/Time     : 2017:12:15 04:53:38-05:00
File Access Date/Time           : 2017:12:15 04:53:38-05:00
File Inode Change Date/Time     : 2018:01:04 01:22:47-05:00
File Permissions                : rw-r--r--
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.4
Linearized                      : No
Creator                         : KMBT_283
Producer                        : KONICA MINOLTA bizhub 283
Create Date                     : 2017:02:14 17:58:02-05:00
Page Count                      : 8

Do you notice the “Creator” and “Producer” tags? It might be possible to pinpoint exactly where in a certain office building a document is created by investigating the right information.

So now you know.

Once again, Exiftool is best as a sanity check. It’s always great to return to this tool to verify that you’ve scrubbed all the possible metadata via other methods. So, now that we understand what metadata looks like, how do we safely remove metadata?

Managing media metadata

Using the MAT

If you’re a Linux user, the Metadata Anonymisation Toolkit, or MAT is a great tool to help you scrub metadata. This tool works really well for a number of file types, like .jpg, .mp3, .flac, and other common media types.

MAT2 Context Menu in Tails OS

To use MAT, navigate to Places in Tails (or other flavors of Linux that use Gnome) and find the location of the file you want to clean. After you find it, right click on the file and click on “Remove metadata.” This will create a new, cleaned up copy of the file, leaving the original intact.

If you run into a “Failed to clean some items” error, click the “Show” button to see if your file isn’t supported, or if something else went wrong.

You can do the same via the command line. Navigate to: Applications > Accessories > Terminal and input:

amnesia@amnesia:~$ mat2 /path/to/dirty/image.jpg

Using FFmpeg

FFmpeg is a much-loved audio-visual swiss army knife that helps users manipulate rich media file types, like .mp4, .mov, .mkv, and .wmv. With FFmpeg, making a metadata-free copy of your original file is as simple as running:

ffmpeg -i /path/to/original/file.mp4 -map_metadata -1 -c:v copy -c:a copy /path/to/clean/clone.mp4

Bad news about Word docs, PDFs, etc

The aforementioned tools work really well with visual and audio media, but text documents are unfortunately much more complex. Documents like .docx, .xlsx, .pdf, .ppt, and others usually contain multiple embedded images, videos, and other media files. They’re kind of like nesting dolls. So, while it’s possible to scrub basic metadata tags from any of these documents, the objects embedded within them have so much metadata of their own that can be individually scrutinized. This makes the idea of software-based retraction somewhat foolish.

Here’s an example: Using another open source tool called Peepdf , we’re able to see all the different objects (like images) embedded into any .pdf file. So, even if we were to strip the metadata from the document itself, anyone can extract any of its individual embedded images, and parse their metadata for more identifying context using any of the aforementioned methods. (Also, did I forget to mention that embedded images could be extremely tiny, and non-visible to the naked eye?)

Instead, it’s best to recreate the document by flattening all the embedded objects before exporting and sharing it. For these types of documents, that means either printing them out, then rescanning; or exporting them to a different format altogether.

First Look Media’s PDF Redact Tools is a great PDF flattening tool. It automates metadata removal by creating an image of each page within a document, and gluing them back together into a brand new PDF. While this is a fabulous tool, here are two downsides: the resulting PDF is usually a lot larger than its original, which might make export and sharing more cumbersome; and it relies upon a library, ImageMagick, with a somewhat buggy history. That said, PDF Redact Tools is incredibly easy to work with, and does an excellent job at metadata removal. If you can install it on a dedicated, sandboxed machine, it makes a great tool to have in your toolkit. Note: PDF Redact Tools is no longer an actively maintained software project, and future security vulnerabilities found in it may not be fixed. It can, however, still be used relatively safely in an isolated environment, ideally, an air-gapped Tails drive.

If you’re interested in doing named entity recognition (NER), word frequencies, or just better searching within text, a flattened PDF file will be hard to work with because all the text will now be image-based. Thankfully, tools exist to “read” images into workable text, like Tesseract . You can explode a flattened PDF into individual images of the pages using PDF Redact Tools, then feed the pages into Tesseract to create a text document that can be worked on with any language processing tool. Beware, however, that the optical character recognition is imperfect, and you might have to comb through the resulting text to fix typos. The English dictionary data is installed by default, but other language data files are available.

Other redaction tools

If you don’t have Photoshop, you may find use in the GNU Image Manipulation Program (GIMP), its open source alternative, which can be used for performing visual redactions to PDFs and other documents.

Audacity is an audio toolkit that allows you to splice audio to your liking. I find it’s the perfect tool for editing interviews that may contain off-the-record statements.

Be aware that these types of edits are non-destructive, meaning that metadata, project history, and artifacts in the original files can be uncovered by forensic analysis. Using GIMP and Audacity is a great way to perform audio and visual redactions, but you should still take care to flatten your media before publishing by using Exiftool to verify you’ve done it correctly, and by “jumping the analog hole.”

The Analog Hole

Although there are a number of really great software tools to help understand, manipulate, and scrub metadata, nothing is perfect. As we explored in the previous section, digital forensic specialists might still be able to uncover bits of history from the bytes in any digital artifact. One creative way to be sure that original metadata is inaccessible is to recreate the original through “the analog hole.”

Have you ever bought a bootleg movie? (It’s ok, no judgements!) If you have, you might remember that those movies were created by someone sneaking into the theater with their own camcorder, and simply taping the entire movie from their seat. That’s an example of the analog hole; and you can use similar tactics to create unattributable copies of your original media.

Some ideas for jumping through the analog hole

Images Take a screenshot from your computer and publish that instead.
Video On macOS? Use a screen recording app, like QuickTime to capture a movie from your screen as it plays.
Audio Purchase an audio loopback cable, and play an audio file directly into a digital recorder. Or, purchase a USB adapter to record audio input directly into your computer.
Office Documents/PDFs Copy the text into a new document. Print the replicated document, and re-scan it into your computer.

Caveats Galore

Once again, nothing is ever perfect. Even the analog hole might lead to some trouble. For example, a well-known tactic in the intelligence community is to create several, almost identical copies of the same document, each one containing minute typos. That way, if a sensitive document finds itself published in the press, the whistleblower would be identified because the printed document would contain the tell-tale typo. This is a clear example of the myriad ways a source may still be compromised despite the great consideration and care you have taken to protect their digital assets. Please be mindful of this when working with submissions.

Donate to support press freedom

Your support is more important than ever.