Common Crawl

Amazing project, that basically makes a more searchable Wayback Machine.

A bit hard to use their data though, partly due to size, but also lack of free to use querrying mechanisms, and how obtuse Amazon S3 is to use.

Notably, aws-cli with an account is the only reliable way, everything else is way too broken, e.g. trying the to check the an index index.commoncrawl.org/CC-MAIN-2023-06/ very often 500s.

But still, their projct is amazing.

The only out-of-the-box search they seem to have is: urlsearch.commoncrawl.org/ for domains/URLs. It is good, but there could be so much more... notably IPs.

Also could should document the data shape a bit better.

Sample sizes can be found at: commoncrawl.org/2023/04/mar-apr-2023-crawl-archive-now-available/

To explore the data, after login:

aws s3 ls s3://commoncrawl/crawl-data/CC-MAIN-2013-20/

Copy the toplevel directory only:

aws s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2013-20/ . --recursive --exclude "*/*"

Copy some wet/wat files:

aws s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/wat/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.wat.gz .
aws s3 sync s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/wet/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.wet.gz .

Directory structrure:

cc-index.paths.gz (1K)
cc-index-table.paths.gz (1K)

segment.paths.gz (1.7K) Sample lines:

crawl-data/CC-MAIN-2013-20/segments/1368696381249/
crawl-data/CC-MAIN-2013-20/segments/1368696381630/

index.html (2.3K)

wat.paths.gz (98K) Sample lines:

crawl-data/CC-MAIN-2013-20/segments/1368696381249/wat/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.wat.gz
crawl-data/CC-MAIN-2013-20/segments/1368696381249/wat/CC-MAIN-20130516092621-00001-ip-10-60-113-184.ec2.internal.warc.wat.gz

wet.paths.gz (98K) Sample lines:

crawl-data/CC-MAIN-2013-20/segments/1368696381249/wet/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.wet.gz
crawl-data/CC-MAIN-2013-20/segments/1368696381249/wet/CC-MAIN-20130516092621-00001-ip-10-60-113-184.ec2.internal.warc.wet.gz

warc.paths.gz (99K)

crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz
crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00001-ip-10-60-113-184.ec2.internal.warc.gz

segments: directgory with actual data

1368696381249: one of many segments, any meaning of name?

CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.wet.gz (142M, 334M unzipped)

A tiny bit of metadata, and then plaintext content from the website, e.g. the second one:

WARC/1.0
WARC-Type: conversion
WARC-Target-URI: http://004eeb5.netsolhost.com/stephensilver.htm
WARC-Date: 2013-05-18T08:11:02Z
WARC-Record-ID: <urn:uuid:773b31ba-ddc6-47a5-ae24-d08141b9944d>
WARC-Refers-To: <urn:uuid:4b1bdbff-4926-4ced-86f6-072f5bb3837a>
WARC-Block-Digest: sha1:LQFSCR2LIJQYMPTXRHWU7HAPQTVSYS3A
Content-Type: text/plain
Content-Length: 12046

Stephen Silver is a journalist and editor who specializes in the areas of politics, pop culture, film and sports. He works as an editor with the North American Publishing Co. and as a film critic with The Trend, a local newspaper in the Philadelphia area.

No IP unfortunately.

CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.wat.gz (329M, 1.4G unzipped)

A lot of JSON metadata and no contents as desired. Contains IP! Some entries however are humongous with a ton of useless data, that's what bloats these so much:

WARC/1.0
WARC-Type: metadata
WARC-Target-URI: CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz
WARC-Date: 2013-11-22T14:51:12Z
WARC-Record-ID: <urn:uuid:ec54e493-8965-41be-b344-07596cc30b3a>
WARC-Refers-To: <urn:uuid:cfeff436-7c4c-4119-aaa4-ec2ce27ad3e1>
Content-Type: application/json
Content-Length: 1180

{"Envelope":{"Format":"WARC","WARC-Header-Length":"274","Block-Digest":"sha1:JCZOI4V3UOTXGIRLFMPLW4J2WPLAKGVR","Actual-Content-Length":"372","WARC-Header-Metadata":{"WARC-Type":"warcinfo","WARC-Filename":"CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz","WARC-Date":"2013-11-22T14:51:12Z","Content-Length":"372","WARC-Record-ID":"<urn:uuid:cfeff436-7c4c-4119-aaa4-ec2ce27ad3e1>","Content-Type":"application/warc-fields"},"Payload-Metadata":{"Trailing-Slop-Length":"0","Actual-Content-Type":"application/warc-fields","Actual-Content-Length":"372","Headers-Corrupt":true,"WARC-Info-Metadata":{"robots":"classic","software":"Nutch 1.6 (CC)/CC WarcExport 1.0","description":"Wide crawl of the web with URLs provided by Blekko for Spring 2013","hostname":"ip-10-60-113-184.ec2.internal","format":"WARC File Format 1.0","isPartOf":"CC-MAIN-2013-20","operator":"CommonCrawl Admin","publisher":"CommonCrawl"}}},"Container":{"Compressed":true,"Gzip-Metadata":{"Footer-Length":"8","Deflate-Length":"453","Header-Length":"10","Inflated-CRC":"866052549","Inflated-Length":"650"},"Offset":"0","Filename":"CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz"}}

WARC/1.0
WARC-Type: metadata
WARC-Target-URI: http://%20jwashington@ap.org/Content/Press-Release/2012/How-AP-reported-in-all-formats-from-tornado-stricken-regions
WARC-Date: 2013-05-18T05:48:54Z
WARC-Record-ID: <urn:uuid:d519658f-7a63-46c1-849b-4cd92332ddb8>
WARC-Refers-To: <urn:uuid:cefd363b-1fec-4590-8305-4c6fab2e095f>
Content-Type: application/json
Content-Length: 1501

{"Envelope":{"Format":"WARC","WARC-Header-Length":"433","Block-Digest":"sha1:B2B6JDSGWCUQIIUGV54SXEE25RX4SANS","Actual-Content-Length":"302","WARC-Header-Metadata":{"WARC-Type":"request","WARC-Date":"2013-05-18T05:48:54Z","WARC-Warcinfo-ID":"<urn:uuid:cfeff436-7c4c-4119-aaa4-ec2ce27ad3e1>","Content-Length":"302","WARC-Record-ID":"<urn:uuid:cefd363b-1fec-4590-8305-4c6fab2e095f>","WARC-Target-URI":"http://%20jwashington@ap.org/Content/Press-Release/2012/How-AP-reported-in-all-formats-from-tornado-stricken-regions","WARC-IP-Address":"165.1.125.44","Content-Type":"application/http; msgtype=request"},"Payload-Metadata":{"Trailing-Slop-Length":"4","HTTP-Request-Metadata":{"Headers":{"Accept-Language":"en-us,en-gb,en;q=0.7,*;q=0.3","Host":"ap.org","Accept-Encoding":"x-gzip, gzip, deflate","User-Agent":"CCBot/2.0","Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"},"Headers-Length":"300","Entity-Length":"0","Entity-Trailing-Slop-Bytes":"0","Request-Message":{"Method":"GET","Version":"HTTP/1.0","Path":"/Content/Press-Release/2012/How-AP-reported-in-all-formats-from-tornado-stricken-regions"},"Entity-Digest":"sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ"},"Actual-Content-Type":"application/http; msgtype=request"}},"Container":{"Compressed":true,"Gzip-Metadata":{"Footer-Length":"8","Deflate-Length":"455","Header-Length":"10","Inflated-CRC":"453539965","Inflated-Length":"739"},"Offset":"453","Filename":"CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz"}}

Let's beautify one of them to see it better:


{
  "Envelope": {
    "Format": "WARC",
    "WARC-Header-Length": "274",
    "Block-Digest": "sha1:JCZOI4V3UOTXGIRLFMPLW4J2WPLAKGVR",
    "Actual-Content-Length": "372",
    "WARC-Header-Metadata": {
      "WARC-Type": "warcinfo",
      "WARC-Filename": "CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz",
      "WARC-Date": "2013-11-22T14:51:12Z",
      "Content-Length": "372",
      "WARC-Record-ID": "<urn:uuid:cfeff436-7c4c-4119-aaa4-ec2ce27ad3e1>",
      "Content-Type": "application/warc-fields"
    },
    "Payload-Metadata": {
      "Trailing-Slop-Length": "0",
      "Actual-Content-Type": "application/warc-fields",
      "Actual-Content-Length": "372",
      "Headers-Corrupt": true,
      "WARC-Info-Metadata": {
        "robots": "classic",
        "software": "Nutch 1.6 (CC)/CC WarcExport 1.0",
        "description": "Wide crawl of the web with URLs provided by Blekko for Spring 2013",
        "hostname": "ip-10-60-113-184.ec2.internal",
        "format": "WARC File Format 1.0",
        "isPartOf": "CC-MAIN-2013-20",
        "operator": "CommonCrawl Admin",
        "publisher": "CommonCrawl"
      }
    }
  },
  "Container": {
    "Compressed": true,
    "Gzip-Metadata": {
      "Footer-Length": "8",
      "Deflate-Length": "453",
      "Header-Length": "10",
      "Inflated-CRC": "866052549",
      "Inflated-Length": "650"
    },
    "Offset": "0",
    "Filename": "CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz"
  }
}

Fuck no IP addresses either. But other entries do have it, why not this one?

The reason these can be huge is the HTML-Metadata section which contain all outlinks! gist.github.com/Smerity/e750f0ef0ab9aa366558#file-bbc-pretty-wat-L34

CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz ()

Obtain:

aws s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz .

Table of contents 579 2
- Common Crawl Athena Common Crawl 4
- Common Crawl web graph Common Crawl 68

Common Crawl

 Ancestors (9)

 Incoming links (3)