So you’re ready to get started.

The Common Crawl corpus contains petabytes of data collected over the last 7 years. It contains raw web page data, extracted metadata and text extractions.

Data Location

The Common Crawl dataset lives on Amazon S3 as part of the Amazon Public Datasets program.
From Public Data Sets, you can download the files entirely free using HTTP or S3.

As the Common Crawl Foundation has evolved over the years, so has the format and metadata that accompany the crawls themselves.

For all crawls since 2013, the data has been stored in the WARC file format and also contains metadata (WAT) and text data (WET) extracts. We also provide file path lists for the segments, WARC, WAT, and WET files, which can be found at CC-MAIN-YYYY-DD/[segment|warc|wat|wet].paths.gz.

By replacing s3://commoncrawl/ with https://commoncrawl.s3.amazonaws.com/ on each line, you can obtain the HTTP path for any of the files stored on S3.

Data Format

Common Crawl currently stores the crawl data using the Web ARChive (WARC) format.
Before that point, the crawl was stored in the ARC file format.
The WARC format allows for more efficient storage and processing of Common Crawl’s free multi-billion page web archives, which can be hundreds of terabytes in size.
This document aims to give you an introduction to working with the new format, specifically the difference between:

WARC files which store the raw crawl data
WAT files which store computed metadata for the data stored in the WARC
WET files which store extracted plaintext from the data stored in the WARC

If you want all the nitty gritty details, the best source is the ISO standard, for which the final draft is available.
If you’re more interested in diving into code, we’ve provided three introductory examples in Java that use the Hadoop framework to process WAT, WET and WARC.

WARC Format

The WARC format is the raw data from the crawl, providing a direct mapping to the crawl process. Not only does the format store the HTTP response from the websites it contacts (WARC-Type: response), it also stores information about how that information was requested (WARC-Type: request) and metadata on the crawl process itself (WARC-Type: metadata).

For the HTTP responses themselves, the raw response is stored. This not only includes the response itself, what you would get if you downloaded the file, but also the HTTP header information, which can be used to glean a number of interesting insights.
In the example below, we can see the crawler contacted http://news.bbc.co.uk/2/hi/africa/3414345.stm and received a HTML page in response. We can also see the page was served from the Apache web server, sets caching details, and attempts to set a cookie (shortened for display here).

Full WARC extract

WARC/1.0
WARC-Type: response
WARC-Date: 2014-08-02T09:52:13Z
WARC-Record-ID: 
Content-Length: 43428
Content-Type: application/http; msgtype=response
WARC-Warcinfo-ID: 
WARC-Concurrent-To: 
WARC-IP-Address: 212.58.244.61
WARC-Target-URI: http://news.bbc.co.uk/2/hi/africa/3414345.stm
WARC-Payload-Digest: sha1:M63W6MNGFDWXDSLTHF7GWUPCJUH4JK3J
WARC-Block-Digest: sha1:YHKQUSBOS4CLYFEKQDVGJ457OAPD6IJO
WARC-Truncated: length

HTTP/1.1 200 OK
Server: Apache
Vary: X-CDN
Cache-Control: max-age=0
Content-Type: text/html
Date: Sat, 02 Aug 2014 09:52:13 GMT
Expires: Sat, 02 Aug 2014 09:52:13 GMT
Connection: close
Set-Cookie: BBC-UID=...; expires=Sun, 02-Aug-15 09:52:13 GMT; path=/; domain=bbc.co.uk;

<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<title>
	BBC NEWS | Africa | Namibia braces for Nujoma exit
</title>
...

WAT Response Format

WAT files contain important metadata about the records stored in the WARC format above. This metadata is computed for each of the three types of records (metadata, request, and response). If the information crawled is HTML, the computed metadata includes the HTTP headers returned and the links (including the type of link) listed on the page.

This information is stored as JSON. To keep the file sizes as small as possible, the JSON is stored with all unnecessary whitespace stripped, resulting in a relatively unreadable format for humans. If you want to inspect the JSON file yourself, you can use one of the many JSON pretty print tools available.

The HTTP response metadata is most likely to be of interest to Common Crawl users. The skeleton of the JSON format is outlined below.

Envelope
  WARC-Header-Metadata
    WARC-Target-URI [string]
    WARC-Type [string]
    WARC-Date [datetime string]
    ...
  Payload-Metadata
    HTTP-Response-Metadata
      Headers
        Content-Language
        Content-Encoding
        ...
      HTML-Metadata
        Head
          Title [string]
          Link [list]
          Metas [list]
        Links [list]
      Headers-Length [int]
      Entity-Length [int]
      ...
    ...
  ...
Container
  Gzip-Metadata [object]
  Compressed [boolean]
  Offset [int]

As an example in Python, if we parsed the JSON into the data object, we could easily pull out interesting information from the BBC article easily…

Full WAT extract

>> data['Envelope']['WARC-Header-Metadata']['WARC-Type']
"response"
>> data['Envelope']['Payload-Metadata']['HTTP-Response-Metadata']['Headers']['Server']
"Apache"
>> data['Envelope']['Payload-Metadata']['HTTP-Response-Metadata']['HTML-Metadata']['Head']['Title']
" BBC NEWS | Africa | Namibia braces for Nujoma exit "
>> len(data['Envelope']['Payload-Metadata']['HTTP-Response-Metadata']['HTML-Metadata']['Links'])
42
>> data['Envelope']['Payload-Metadata']['HTTP-Response-Metadata']['HTML-Metadata']['Links'][28]
{"path": "[email protected]/href", "title": "Home of BBC Sport on the internet", "url": "http://news.bbc.co.uk/sport1/hi/default.stm"}

WET Response Format

As many tasks only require textual information, the Common Crawl dataset provides WET files that only contain extracted plaintext. The way in which this textual data is stored in the WET format is quite simple. The WARC metadata contains various details, including the URL and the length of the plaintext data, with the plaintext data following immediately afterwards.

Full WET extract

WARC/1.0
WARC-Type: conversion
WARC-Target-URI: http://news.bbc.co.uk/2/hi/africa/3414345.stm
WARC-Date: 2014-08-02T09:52:13Z
WARC-Record-ID: 
WARC-Refers-To: 
WARC-Block-Digest: sha1:JROHLCS5SKMBR6XY46WXREW7RXM64EJC
Content-Type: text/plain
Content-Length: 6724

BBC NEWS | Africa | Namibia braces for Nujoma exit
...
President Sam Nujoma works in very pleasant surroundings in the small but beautiful old State House...

Processing the file format

We maintain introductory examples on GitHub for the following programming languages and big data processing frameworks:

For each of these platforms, the examples describe how to:

Count the number of times various tags are used across HTML on the internet using the WARC files
Counting the number of different server types found in the HTTP headers using the WAT files
Execute a word count over the extracted plaintext found in the WET files

If you’re using a different programming language or prefer to work with another processing framework, there are a number of open source libraries that handle processing WARC files and the content therein, including:

webrecorder’s warcio library for processing WARC and ARC files (Python 2.7 and 3.3+)
IIPC’s Web Archive Commons library for processing WARC & WAT (Java)

More tools and libraries are found on the list of Awesome Web Archiving utilities maintained by the IIPC.