ID photo of Ciro Santilli taken in 2013 right eyeCiro Santilli OurBigBook logoOurBigBook.com  Sponsor 中国独裁统治 China Dictatorship 新疆改造中心、六四事件、法轮功、郝海东、709大抓捕、2015巴拿马文件 邓家贵、低端人口、西藏骚乱
ns.csv is 57 GB. This file is too massive, working with it is a pain.
We can also cut down the data a lot with stackoverflow.com/questions/1915636/is-there-a-way-to-uniq-by-column/76605540#76605540 and tld filtering:
awk -F, 'BEGIN{OFS=","} { if ($1 != last) { print $1, $3; last = $1; } }' ns.csv | grep -E '\.(com|net|info|org|biz),' > nsu.csv
This brings us down to a much more manageable 3.0 GB, 83 M rows.
Let's just scan it once real quick to start with, since likely nothing will come of this venue:
grep -f <(awk -F, 'NR>1{print $2}' ../media/cia-2010-covert-communication-websites/hits.csv) nsu.csv | tee nsu-hits.csv
cat nsu-hits.csv | csvcut -c 2 | sort | awk -F. '{OFS="."; print $(NF-1), $(NF)}' | sort | uniq -c | sort -k1 -n
As of 267 hits we get:
      1 a2hosting.com
      1 amerinoc.com
      1 ayns.net
      1 dailyrazor.com
      1 domainingdepot.com
      1 easydns.com
      1 frienddns.ru
      1 hostgator.com
      1 kolmic.com
      1 name-services.com
      1 namecity.com
      1 netnames.net
      1 tonsmovies.net
      1 webmailer.de
      2 cashparking.com
     55 worldnic.com
     86 domaincontrol.com
so yeah, most of those are likely going to be humongous just by looking at the names.
The smallest ones by far from the total are: frienddns.ru with only 487 hits, all others quite large or fake hits due to CSV. Did a quick Wayback Machine CDX scanning there but no luck alas.
Let's check the smaller ones:
inews-today.com,2013-08-12T03:14:01,ns1.frienddns.ru
source-commodities.net,2012-12-13T20:58:28,ns1.namecity.com -> fake hit due to grep e-commodities.net
dailynewsandsports.com,2013-08-13T08:36:28,ns3.a2hosting.com
just-kidding-news.com,2012-02-04T07:40:50,jns3.dailyrazor.com
fightwithoutrules.com,2012-11-09T01:17:40,sk.s2.ns1.ns92.kolmic.com
fightwithoutrules.com,2013-07-01T22:46:23,ns1625.ztomy.com
half-court.net,2012-09-10T09:49:15,sk.s2.ns1.ns92.kolmic.com
half-court.net,2013-07-07T00:31:12,ns1621.ztomy.com
Doubt anything will come out of this.
Let's do a bit of counting out of the total:
grep domaincontrol.com ns.csv | awk -F, '{print $1}' | uniq | wc
gives ~20M domain using domaincontrol. Let's see how many domains are in the first place:
awk -F, '{print $1}' ns.csv | uniq | wc
so it accounts for 1/4 of the total.

Ancestors (13)

  1. DNS Census 2013
  2. Data sources
  3. Methodology
  4. CIA 2010 covert communication websites
  5. Central Intelligence Agency
  6. Intelligence agency
  7. Secret service
  8. Espionage
  9. War
  10. Social science
  11. Scientific method
  12. Science
  13. Home