US Press Freedom Tracker Data now available on the decentralized web via IPFS

miguel

Miguel Jacq (he/him) a.k.a mig5 is a systems administrator based in Melbourne, Australia.

The U.S. Press Freedom Tracker, where we attempt to document virtually every press freedom violation in the country, has, for some time, made available its database of thousands of incidents for export via the API. We want all the data we’ve collected over the past five years to be available to journalists, advocates, and policy makers for their own analysis.

As an organization committed to helping journalists resist censorship and ensure information remains free, we’ve recently been exploring how we can use the decentralized web, and in particular IPFS, to more permanently store the vast wealth of information now on the Tracker. IPFS, for the uninitiated, is an innovative means to distribute data in a way that doesn’t depend on centralized infrastructure (such as a website).

In some ways similar to torrents, files shared on the IPFS network are mirrored among many nodes. This makes it a protocol particularly resistant to censorship or deletion, and may have other qualities significant to journalists as the internet evolves over time.

To this end, as a proof of concept, we’ve now published the Press Freedom Tracker’s incident database on IFPS. (You can, of course, still use the website API as well). You can view the database at this IPNS ID:

ipns://k51qzi5uqu5dlnwjrnyyd6sl2i729d8qjv1bchfqpmgfeu8jn1w1p4q9x9uqit

You can view the ID via an IPFS web gateway, such as the one provided by Cloudflare, via a browser extension like IPFS Companion, or via another IPFS client. The file is updated about every hour (more on that below), so you can ensure that the dataset you are downloading is the most current.

ipfs.png

US Press Freedom Tracker data via IPFS, as viewed in Brave Browser.

A technical deep-dive into IPFS, IPNS, and keeping track of changes to the database

IPFS is an interesting protocol because its content identifiers (CIDs) or ‘hashes’ are cryptographically computed from the content of the file, not its name or other metadata.

This means that every time the file’s content changes, publishing it in IPFS gets a new CID.

There is nothing in the protocol that maintains any sort of ‘revision’ relationship between the old CID and the new one. It is up to the publisher to keep track of old versions of the file (if that’s important to them). Equally, it’s up to the publisher to tell people which CID is the new one, but it would be annoying to have to keep announcing new CIDs every time the file changes.

For this reason, the ID above is an ‘IPNS’ ID, which always points to the latest version of the folder and its contents, without itself ever having to change. IPNS is a little bit like DNS, in that it’s a sort of static ‘alias’ or pointer to another destination - in this case, the latest IPFS CID of the directory.

To maintain a sort of ‘revision’ log of changes to the incidents.csv database (and when it changed), we also publish a changelog file (incidents-log.csv) which shows the previous CIDs and a timestamp of when they were published. The last line in the file is always the latest version of the incidents.csv. You can also fetch the latest file directly (rather than view the directory) by using the IPNS hash, for example:

ipns://k51qzi5uqu5dlnwjrnyyd6sl2i729d8qjv1bchfqpmgfeu8jn1w1p4q9x9uqit/incidents.csv

Feel free to look at older CIDs to see the difference, or to consult the file to find out when the latest version was published.

How often is the data published to IPFS?

We attempt to publish the latest copy of the database to IPFS every hour, but realistically the database itself changes far less frequently. The database is only published (and the changelog updated) if its content changes.

Care to share some code?

We initially tried to use what seems to be the official Python library for working with the IPFS API, but found that it doesn’t seem to support the most recent releases of go-ipfs, and is possibly semi-abandoned.

Fortunately, the go-ipfs service provides its own HTTP RPC API, so we could use Python’s requests module to talk to it.

Publishing a single file to the IPFS API is quite easy, and there are simple examples of how to do it. However, it turns out that publishing a directory containing files was a little more tricky to achieve.

It took a bit of trial and error to work out how to send multiple files in a multipart request with the right tuple values per file, in a way that matched the IPFS API’s documentation, but we got there.

For those curious, here’s a sample of what worked for us. Happy hacking!

If you’re looking to install IPFS on a Linux server, we used an Ansible role for that, which worked great.

Donate to support press freedom

Your support is more important than ever.