CIA 2010 covert communication websites / Wayback Machine CDX scanning with Tor parallelization

Dire times require dire methods: cia-2010-covert-communication-websites/cdx-tor.sh.

First we must start the tor servers with the tor-army command from: stackoverflow.com/questions/14321214/how-to-run-multiple-tor-processes-at-once-with-different-exit-ips/76749983#76749983

tor-army 100

and then use it on a newline separated domain name list to check;

./cdx-tor.sh infile.txt

This creates a directory infile.txt.cdx/ containing:

infile.txt.cdx/out00, out01, etc.: the suspected CDX lines from domains from each tor instance based on the simple criteria that the CDX can handle directly. We split the input domains into 100 piles, and give one selected pile per tor instance.
infile.txt.cdx/out: the final combined CDX output of out00, out01, ...
infile.txt.cdx/out.post: the final output containing only domain names that match further CLI criteria that cannot be easily encoded on the CDX query. This is the cleanest domain name list you should look into at the end basically.

Since archive is so abysmal in its data access, e.g. a Google BigQuery would solve our issues in seconds, we have to come up with creative ways of getting around their IP throttling.

The CIA doesn't play fair. They're actually the exact opposite of fair. So neither shall we.

Distilled into an answer at: stackoverflow.com/questions/14321214/how-to-run-multiple-tor-processes-at-once-with-different-exit-ips/76749983#76749983

This should allow a full sweep of the 4.5M records in 2013 DNS Census virtual host cleanup in a reasonable amount of time. After JAR/SWF/CGI filtering we obtained 5.8k domains, so a reduction factor of about 1 million with likely very few losses. Not bad.

5.8k is still a bit annoying to fully go over however, so we can also try to count CDX hits to the domains and remove anything with too many hits, since the CIA websites basically have very few archives:

cd 2013-dns-census-a-novirt-domains.txt.cdx
./cdx-tor.sh -d out.post domain-list.txt
cd out.post.cdx
cut -d' ' -f1 out | uniq -c | sort -k1 -n | awk 'match($2, /([^,]+),([^)]+)/, a) {printf("%s.%s %d\n", a[2], a[1], $1)}' > out.count

This gives us something like:

12654montana.com 1
aeronet-news.com 1
atohms.com 1
av3net.com 1
beechstreetas400.com 1

sorted by increasing hit counts, so we can go down as far as patience allows for!

New results from a full CDX scan of 2013-dns-census-a-novirt.csv:

219.90.61.123 journeystravelled.com

CIA 2010 covert communication websites / Wayback Machine CDX scanning with Tor parallelization

 Ancestors (16)

 Incoming links (2)