CIA 2010 covert communication websites / Wayback Machine CDX scanning

The Wayback Machine has an endpoint to query cralwed pages called the CDX server. It is documented at: github.com/internetarchive/wayback/blob/master/wayback-cdx-server/README.md.

This allows to filter down 10 thousands of possible domains in a few hours. But 100s of thousands would be too much. This is because you have to query exactly one URL at a time, and they possibly rate limit IPs. But no IP blacklisting so far after several hours, so it's not that bad.

Once you have a heuristic to narrow down some domains, you can use this helper: cia-2010-covert-communication-websites/cdx.sh to drill them down from 10s of thousands down to hundreds or thousands.

We then post process the results of cdx.sh with cia-2010-covert-communication-websites/cdx-post.sh to drill them down from from thousands to dozens, and manually inspect everything.

From then on, you can just manually inspect for hist on your browser.

Table of contents 606 2
- Wayback Machine CDX scanning with Tor parallelization Wayback Machine CDX scanning 337
- JS CDX scanning Wayback Machine CDX scanning 132

CIA 2010 covert communication websites / Wayback Machine CDX scanning

 Ancestors (15)

 Incoming links (7)