Metadata comes in handy sometimes, like when you’re flipping through old pictures by date, or by location. But in the wrong hands, this same information could be damaging.
Metadata is data about data. Every single digital artifact has it. It describes the who, what, when, where, how, and sometimes even, why, for any document, video, photo, or sound clip. This information comes in handy sometimes, like when you’re flipping through old pictures by date, or by location. But in the wrong hands, this same information could be damaging.
Metadata exists in the parts of images, videos, or music that we can’t experience as humans. But if you pry into any digital artifact, you can see metadata as a list of keys (or tags) and their corresponding values. One of the simplest tags is “Creation Date,” which naturally points to the time when its creator pushed the shutter button, or pressed record. Other interesting tags include the “Make” and “Model” tags, which can tell you what type of camera or computer was used to create the media. There are dozens of such tags, and each one can help tell a very distinct story; this is why understanding how rich metadata is can better help protect the identities of sources who have shared their digital media with you.
Most of the tools mentioned in this guide can be used on most computer operating systems. While you could easily turn your computer into a powerful, metadata-crunching workhorse, take your own privacy and security into consideration. You may be working with extra-sensitive material, so it might not be wise to handle it on your day-to-day machine.
I find it easiest to juggle these considerations by compartmentalizing my workspace; having a dedicated space to prod, cut, copy, and paste gives me more confidence in my ability to handle sensitive media safely and sanely. I build myself a “sandbox”: a somewhat safe place to do somewhat dangerous things.
Tails (The Amnesic Incognito Live System), is a fully self-contained computer that lives on a USB drive. To use it, install Tails on a blank USB drive and plug it in to any PC or Mac. You’ll need to instruct your computer to boot up from USB, instead of your normal operating system (i.e. macOS or Windows). When you boot into it, you enable a session to do whatever you want to do, in relative safety. Once you shut down, all traces of your session are erased. This makes it an ideal sandbox.
Tails is an almost perfect choice for a media workstation, as it comes with tools like MAT, Exiftool, Gimp, and Audacity right out-of-the-box. For software packages that aren’t installed on Tails by default, you will have to start a Tails session with a set admin password and then download and install the appropriate software.
For example, if you want to install a PDF cleanup tool like First Look Media’s PDF Redact Tools in Tails, first connect to the internet and wait for Tor to get read. Then, navigate to: Applications > System Tools > Synaptic Package Manager and use the search feature to look for “pdf redact tools.”
Once the installation is complete, Tails will ask if you want to install the selected application for just this session, or all sessions (with persistence enabled) going forward. Let’s talk about the latter option...
Remember, Tails is amnesic; once you end a session, all files and software that weren’t originally included in Tails will be lost. However, there is a way to enable persistence on your Tails USB drive so you can install extra software, manage projects, etc. between reboots. Follow the instructions from the Tails website for enabling persistence before installing things like FFmpeg or starting more advanced projects.
Additional software takes some time to be available across reboots. This is because Tails must re-install each program at the beginning of a new session. Please be patient, and wait for the notification reading “Your additional software are installed” before attempting to use any additional programs.
Exiftool is an open source software program that allows you to analyze, edit, and clear metadata. While it’s capable of handling multiple file types (images, videos, audio, text, etc.), it isn’t exceptionally capable of removing or overwriting metadata from files other than simple image formats. There are better tools and workflows to fully remove metadata, but we’ll get to this in another section.
In this section, let’s use Exiftool to explore metadata in more depth.
In this example, I was able to read the entire history of an image I posted to my Flickr account.
[email protected]:~$ exiftool idied.jpg ExifTool Version Number : 10.71 File Name : idied.jpg Directory : . File Size : 170 kB File Modification Date/Time : 2018:01:04 01:06:30-05:00 File Access Date/Time : 2018:01:04 01:06:31-05:00 File Inode Change Date/Time : 2018:01:04 01:06:31-05:00 File Permissions : rw-r--r-- File Type : JPEG File Type Extension : jpg MIME Type : image/jpeg JFIF Version : 1.01 Exif Byte Order : Little-endian (Intel, II) Make : EASTMAN KODAK COMPANY Camera Model Name : KODAK EASYSHARE C653 ZOOM DIGITAL CAMERA Orientation : Rotate 270 CW X Resolution : 480 Y Resolution : 480 Resolution Unit : inches Y Cb Cr Positioning : Co-sited Exposure Time : 1/13 F Number : 4.6 Exposure Program : Program AE ISO : 160 Exif Version : 0221 Date/Time Original : 2006:01:09 07:25:05 Create Date : 2006:01:09 07:25:05 Components Configuration : Y, Cb, Cr, - Shutter Speed Value : 1/13 Aperture Value : 4.8 Exposure Compensation : 0 Max Aperture Value : 4.8 Metering Mode : Multi-segment Light Source : Unknown Flash : Off, Did not fire Focal Length : 18.0 mm Serial Number : KCFGP71706722 Flashpix Version : 0100 Color Space : sRGB Exif Image Width : 2848 Exif Image Height : 2144 Interoperability Index : R98 - DCF basic file (sRGB) Interoperability Version : 0100 Exposure Index : 160 Sensing Method : One-chip color area File Source : Digital Camera Scene Type : Directly photographed Custom Rendered : Normal Exposure Mode : Auto White Balance : Auto Digital Zoom Ratio : 0 Focal Length In 35mm Format : 108 mm Scene Capture Type : Standard Gain Control : Low gain up Contrast : Normal Saturation : Normal Sharpness : Normal Subject Distance Range : Unknown Compression : JPEG (old-style) Thumbnail Offset : 12214 Thumbnail Length : 5778 Image Width : 1280 Image Height : 963 Encoding Process : Baseline DCT, Huffman coding Bits Per Sample : 8 Color Components : 3 Y Cb Cr Sub Sampling : YCbCr4:2:0 (2 2) Aperture : 4.6 Image Size : 1280x963 Megapixels : 1.2 Scale Factor To 35 mm Equivalent: 6.0 Shutter Speed : 1/13 Thumbnail Image : (Binary data 5778 bytes, use -b option to extract) Circle Of Confusion : 0.005 mm Field Of View : 18.9 deg Focal Length : 18.0 mm (35 mm equivalent: 108.0 mm) Hyperfocal Distance : 14.07 m Light Value : 7.4
That’s a lot of information in just one photo! Among other things, we know that sometime in 2006 (imperfect timestamps notwithstanding), someone took a photo of me with my Kodak EasyShare camera. The lighting, lack of flash, and aperture are decisions the photographer made. Most importantly, you might notice the “Serial Number” tag—it’s now a very public fact that I have indeed owned this camera in the early 2000’s.
Luckily, we can use this same tool to scrub all the personalizing metadata from the image.
me@computer:~$ exiftool “-all=” idied.jpg
me@computer:~$ exiftool RubenerdShow363.mp3 ExifTool Version Number : 10.71 File Name : RubenerdShow363.mp3 Directory : . File Size : 23 MB File Modification Date/Time : 2018:01:03 14:11:11-05:00 File Access Date/Time : 2018:01:03 21:29:45-05:00 File Inode Change Date/Time : 2018:01:03 14:11:14-05:00 File Permissions : rw-r--r-- File Type : MP3 File Type Extension : mp3 MIME Type : audio/mpeg MPEG Audio Version : 1 Audio Layer : 3 Audio Bitrate : 128 kbps Sample Rate : 44100 Channel Mode : Joint Stereo MS Stereo : On Intensity Stereo : Off Copyright Flag : False Original Media : True Emphasis : None Encoder : LAME3.99r Lame VBR Quality : 4 Lame Quality : 0 Lame Method : CBR Lame Low Pass Filter : 17 kHz Lame Bitrate : 128 kbps Lame Stereo Mode : Joint Stereo ID3 Size : 57034 Release Time : 2017 Original Release Time : 2017:07:14 Recording Time : 2017:07:14 Encoding Time : 2017:07:14 Tagging Time : 2017:07:14 Picture MIME Type : image/png Picture Type : Front Cover Picture Description : Picture : (Binary data 54706 bytes, use -b option to extract) Lyrics : (SHOWNOTES) 25:22 Join Ruben as he harkens back to one of the first reboot episodes in 2015, when he was also wandering around an empty house that was once his home. Two years later, and he's moving out of the place he moved away from that earlier place to. This show description had several variants of the word move in it. Recorded 3rd of July 2017...Recorded in Sydney, Australia. Licence for this track: Creative Commons Attribution 3.0. Attribution: Ruben Schade...Released July 2017 on Rubnerd and The Overnightscape Underground, an Internet talk radio channel focusing on a freeform monologue style, with diverse and fascinating hosts... Track : 363 Artist : Ruben Schade Album : Rubnerd Show Band : Ruben Schade Title : 363: The everything except episode Genre : New Time Radio Publisher : Ruben Schade Internet Radio Station Name : Overnightscape Underground Internet Radio Station Owner : Frank Edward Nora File URL : https://archive[.]org/download/RubenerdShow363/RubenerdShow363.mp3 Artist URL : https://rubenerd[.]com/ Source URL : https://rubenerd[.]com/show363/ Internet Radio Station URL : https://onsug[.]com/ Copyright URL : http://creativecommons[.]org/licenses/by/3.0/ Publisher URL : https://rubenerd[.]com/show/ Date/Time Original : 2017:07:14 Duration : 0:25:18 (approx)
[email protected]:~$ exiftool Anonymous\ Witness\ 1\,\ Union\ Laborer_3.13.90.pdf ExifTool Version Number : 10.71 File Name : Anonymous Witness 1, Union Laborer_3.13.90.pdf Directory : . File Size : 1849 kB File Modification Date/Time : 2017:12:15 04:53:38-05:00 File Access Date/Time : 2017:12:15 04:53:38-05:00 File Inode Change Date/Time : 2018:01:04 01:22:47-05:00 File Permissions : rw-r--r-- File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.4 Linearized : No Creator : KMBT_283 Producer : KONICA MINOLTA bizhub 283 Create Date : 2017:02:14 17:58:02-05:00 Page Count : 8
Do you notice the “Creator” and “Producer” tags? It might be possible to pinpoint exactly where in a certain office building a document is created by investigating the right information.
Once again, Exiftool is best as a sanity check. It’s always great to return to this tool to verify that you’ve scrubbed all the possible metadata via other methods. So, now that we understand what metadata looks like, how do we safely remove metadata?
If you’re a Linux user, the Metadata Anonymisation Toolkit, or MAT is a great tool to help you scrub metadata. This tool works really well for a number of file types, like
.flac, and other common media types.
To use MAT, navigate to Places in Tails (or other flavors of Linux that use Gnome) and find the location of the file you want to clean. After you find it, right click on the file and click on “Remove metadata.” This will create a new, cleaned up copy of the file, leaving the original intact.
If you run into a “Failed to clean some items” error, click the “Show” button to see if your file isn’t supported, or if something else went wrong.
You can do the same via the command line. Navigate to: Applications > Accessories > Terminal and input:
[email protected]:~$ mat2 /path/to/dirty/image.jpg
FFmpeg is a much-loved audio-visual swiss army knife that helps users manipulate rich media file types, like
.wmv. With FFmpeg, making a metadata-free copy of your original file is as simple as running:
ffmpeg -i /path/to/original/file.mp4 -map_metadata -1 -c:v copy -c:a copy /path/to/clean/clone.mp4
The aforementioned tools work really well with visual and audio media, but text documents are unfortunately much more complex. Documents like
.ppt, and others usually contain multiple embedded images, videos, and other media files. They’re kind of like nesting dolls. So, while it’s possible to scrub basic metadata tags from any of these documents, the objects embedded within them have so much metadata of their own that can be individually scrutinized. This makes the idea of software-based retraction somewhat foolish.
Here’s an example: Using another open source tool called Peepdf, we’re able to see all the different objects (like images) embedded into any
Instead, it’s best to recreate the document by flattening all the embedded objects before exporting and sharing it. For these types of documents, that means either printing them out, then rescanning; or exporting them to a different format altogether.
First Look Media’s PDF Redact Tools is a great PDF flattening tool. It automates metadata removal by creating an image of each page within a document, and gluing them back together into a brand new PDF. While this is a fabulous tool, here are two downsides: the resulting PDF is usually a lot larger than its original, which might make export and sharing more cumbersome; and it relies upon a library, ImageMagick, with a somewhat buggy history. That said, PDF Redact Tools is incredibly easy to work with, and does an excellent job at metadata removal. If you can install it on a dedicated, sandboxed machine, it makes a great tool to have in your toolkit.
If you’re interested in doing named entity recognition (NER), word frequencies, or just better searching within text, a flattened PDF file will be hard to work with because all the text will now be image-based. Thankfully, tools exist to “read” images into workable text, like Tesseract. You can explode a flattened PDF into individual images of the pages using PDF Redact Tools, then feed the pages into Tesseract to create a text document that can be worked on with any language processing tool. Beware, however, that the optical character recognition is imperfect, and you might have to comb through the resulting text to fix typos. The English dictionary data is installed by default, but other language packs are available. Visit the Tesseract wiki for more information.
If you don’t have Photoshop, you may find use in the GIMP, its open source alternative, which can be used for performing visual redactions to PDFs and other documents.
Audacity is an audio toolkit that allows you to splice audio to your liking. I find it’s the perfect tool for editing interviews that may contain off-the-record statements.
Be aware that these types of edits are non-destructive, meaning that metadata, project history, and original artifacts can be uncovered by forensic analysis. Using GIMP and Audacity is a great way to perform audio and visual redactions, but you should still take care to flatten your media before publishing by “jumping the analog hole” and using Exiftool to verify you’ve done it correctly.
Although there are a number of really great software tools to help understand, manipulate, and scrub metadata, nothing is perfect. As we explored in the previous section, digital forensic specialists might still be able to uncover bits of history from the bytes in any digital artifact. One creative way to be sure that original metadata is inaccessible is to recreate the original through “the analog hole”.
Have you ever bought a bootleg movie? (It’s ok, no judgements!) If you have, you might remember that those movies were created by someone sneaking into the theater with their own camcorder, and simply taping the entire movie from their seat. That’s an example of the analog hole; and you can use similar tactics to create unattributable copies of your original media.
|Images||Take a screenshot from your computer and publish that instead.|
|Video||Use the QuickTime tool to capture a movie from your screen as it plays.|
|Audio||Purchase an audio loopback cable, and play an MP3 directly into a digital recorder. Or, purchase a USB adapter to record audio input directly into your computer.|
|Office Documents/PDFs||Copy the text into a new document. Print the replicated document, and re-scan it into your computer.|
Once again, nothing is ever perfect. Even the analog hole might lead to some trouble. For example, a well-known tactic in the intelligence community is to create several, almost identical copies of the same document, each one containing minute typos. That way, if a sensitive document finds itself published in the press, the whistleblower would be identified because the printed document would contain the tell-tale typo. This is a clear example of the myriad ways a source may still be compromised despite the great consideration and care you have taken to protect their digital assets. Please be mindful of this when working with submissions.