A recent federal appeals court ruling may be a big win for data journalists and researchers who depend on scraping — the automated collection of data from websites — to collect information on which they report.
The case, involving the professional social networking giant LinkedIn and the data startup hiQ Labs, has been widely discussed in legal circles and the security community, but may be just as significant for journalists.
The legal controversy centers on a database of information that hiQ had scraped from the portion of LinkedIn profiles that were set to be publicly accessible on the web. LinkedIn sought to stop hiQ from accessing that data, sending a letter threatening to sue under the Computer Fraud and Abuse Act and a handful of other grounds. hiQ then challenged that letter in court. In response, the Ninth Circuit in San Francisco ruled last week that the CFAA likely does not prohibit the scraping of public web pages.
Strictly speaking, the court has not yet definitively ruled on the issues at the heart of hiQ Labs v. LinkedIn. Rather, it said hiQ is likely enough to win on the merits of its argument — including the CFAA question — that LinkedIn must allow the startup to continue scraping while the case continues. It's a bit of a confusing procedural posture, but some takeaways are clear and important.
While CFAA rulings — in this circuit and elsewhere — have become a bit of a contradictory thicket, here the Ninth digs in specifically to the statutory language on authorization, or who can access what. Namely, the CFAA prohibits "exceed[ing] authorized access" to a computer.
The ruling lays out a neat taxonomy of computer information, dividing it into three parts: information for which access is open to the general public and permission is not required; information for which authorization is required and has been given; and information for which authorization is required but has not been given.
Last week’s ruling specifically covers LinkedIn’s publicly available data, which it correctly describes as falling into the first category — one that also includes the vast swath of information available on the public web. LinkedIn had argued that, by sending a cease-and-desist letter, it revoked hiQ’s “authorization” to use the site. The court dispensed with that idea: information that is presumptively available to all requires no special authorization to access, and so there’s no authorization to revoke.
Many journalistic endeavors that involve scraping fall precisely into that first category, which is why the ruling is significant.
Journalists may automate visits to an Inspector General’s web page, to be alerted when there are newly published reports. They may write a script to download all the previous meeting agendas of a community board committee at once, to analyze how often a topic has been discussed. They may back up the online marketing materials for a business they’re reporting on, to monitor whether it quietly makes changes after an expose is published.
For journalists and researchers, web scraping — and other mechanisms of automating computer usage — can be an invaluable source of raw data, but has occasionally hit legal friction, especially around the CFAA.
- First Look Media and a collection of academics described bodies of research they were creating, using techniques that include scraping, in a constitutional challenge to the CFAA. First Look’s portion of the complaint was dismissed in 2018, but the ACLU continues to represent plaintiffs who investigate whether sites are systematically discriminating against certain classes of users.
- A broad coalition of journalists and researchers, represented by the Knight First Amendment Institute, challenged Facebook to establish a "safe harbor" from CFAA prosecution for activities that include scraping the site or creating temporary accounts. Some courts have allowed plaintiffs to argue that terms-of-service violations also constitute CFAA violations, so the group asked specifically for Facebook to clarify its terms to explicitly allow that behavior.
- A team of ProPublica reporters scraped some 80,000 criminal records to compare with data collected through public records requests in order to analyze patterns of discrimination in one component of the prison parole process. Julia Angwin, a Pulitzer Prize-winning writer on that research team, later said of their research, “The CFAA criminalizes practically everything I do in my reporting.”
While these examples don’t deal exclusively with the sort of entirely public web content addressed in the hiQ ruling, they demonstrate both the power of scraping as a tool, and the peril of the CFAA as a threat.
By taking the common-sense position that these activities are “not analogous to ‘breaking-and-entering,’” last week’s ruling provides legal cover for the myriad journalistic uses of public web scraping against the dark cloud of the CFAA.