Let' see if there's anything in records/mx.xz.
mx.csv is 21GB.
They do have
"
in the files to escape commas so:mx.pyWould have been better with csvkit: stackoverflow.com/questions/36287982/bash-parse-csv-with-quotes-commas-and-newlines
import csv
import sys
writer = csv.writer(sys.stdout)
with open('mx.csv', 'r') as f:
reader = csv.reader(f)
for row in reader:
writer.writerow([row[0], row[3]])
then:
# uniq not amazing as there are often two or three slightly different records repeated on multiple timestamps, but down to 11 GB
python3 mx.py | uniq > mx-uniq.csv
sqlite3 mx.sqlite 'create table t(d text, m text)'
# 13 GB
time sqlite3 mx.sqlite ".import --csv --skip 1 'mx-uniq.csv' t"
# 41 GB
time sqlite3 mx.sqlite 'create index td on t(d)'
time sqlite3 mx.sqlite 'create index tm on t(m)'
time sqlite3 mx.sqlite 'create index tdm on t(d, m)'
# Remove dupes.
# Rows: 150m
time sqlite3 mx.sqlite <<EOF
delete from t
where rowid not in (
select min(rowid)
from t
group by d, m
)
EOF
# 15 GB
time sqlite3 mx.sqlite vacuum
Let's see what the hits use:
awk -F, 'NR>1{ print $2 }' ../media/cia-2010-covert-communication-websites/hits.csv | xargs -I{} sqlite3 mx.sqlite "select distinct * from t where d = '{}'"
At around 267 total hits, only 84 have MX records, and from those that do, almost all of them have exactly:with only three exceptions:We need to count out of the totals!which gives, ~18M, so nope, it is too much by itself...
smtp.secureserver.net
mailstore1.secureserver.net
dailynewsandsports.com|dailynewsandsports.com
inews-today.com|mail.inews-today.com
just-kidding-news.com|just-kidding-news.com
sqlite3 mx.sqlite "select count(*) from t where m = 'mailstore1.secureserver.net'"
Let's try to use that to reduce where
av.sqlite
from 2013 DNS Census virtual host cleanup a bit further:time sqlite3 mx.sqlite '.mode csv' "attach 'aiddcu.sqlite' as 'av'" '.load ./ip' "select ipi2s(av.t.i), av.t.d from av.t inner join t as mx on av.t.d = mx.d and mx.m = 'mailstore1.secureserver.net' order by av.t.i asc" > avm.csv
avm
stands for av
with mx
pruning. This leaves us with only ~500k entries left. With one more figerprint we could do a Wayback Machine CDX scanning scan.Let's check that we still have most our hits in there:At 267 hits we got 81, so all are still present.
grep -f <(awk -F, 'NR>1{print $2}' /home/ciro/bak/git/media/cia-2010-covert-communication-websites/hits.csv) avm.csv
secureserver is a hosting provider, we can see their blank page e.g. at: web.archive.org/web/20110128152204/http://emmano.com/. security.stackexchange.com/questions/12610/why-did-secureserver-net-godaddy-access-my-gmail-account/12616#12616 comments:
secureserver.net is the name GoDaddy use as the reverse DNS for IP addresses used for dedicated/virtual server hosting