ID photo of Ciro Santilli taken in 2013 right eyeCiro Santilli OurBigBook logoOurBigBook.com  Sponsor 中国独裁统治 China Dictatorship 新疆改造中心、六四事件、法轮功、郝海东、709大抓捕、2015巴拿马文件 邓家贵、低端人口、西藏骚乱
commoncrawl.org/web-graphs
In 2017 apparently they've started making their own Web Graphs, i.e. they parse the HTML and extract the graph of what links to what.
This is exactly what we need for an open implementation of PageRank.
Edit: actually, they already calculate PageRank for us!!! Fantastic!!! Main section: Section "Common Crawl web graph official PageRank".
The graphs are dumped in BVGraph format.
A quick exploration of the graph can be seen at: stackoverflow.com/questions/31321009/best-more-standard-graph-representation-file-format-graphson-gexf-graphml/79467334#79467334
Their source code is at: github.com/commoncrawl/cc-webgraph

Tagged (1)

Ancestors (10)

  1. Common Crawl
  2. Open web crawling
  3. Web crawling
  4. Search engine
  5. Software
  6. Computer
  7. Information technology
  8. Area of technology
  9. Technology
  10. Home