D'oh.
But to be serious. The Wayback Machine contains a very large proportion of all sites. It is the most complete database we have found so far. Some archives are very broken. But those are rares.
The only problem with the Wayback Machine is that there is no known efficient way to query its archives across domains. You have to have a domain in hand for CDX queries: Wayback Machine CDX scanning.
The Common Crawl project attempts in part to address this lack of querriability, but we haven't managed to extract any hits from it.
CDX + 2013 DNS Census + heuristics however has been fruitful however.
Ancestors
Incoming links
- CIA 2010 covert communication websites
- Feedsdemexicoyelmundo.com JavaScript reverse engineering
- Gathering key points from the articles
- Hits with nearby IP hits
- IP range search
- List of websites
- Oleg Shakirov's findings
- Searching for Carson
- Secure subdomain search on 2013 DNS Census
- Selected screenshots
- The Reuters websites
- Wayback Machine