Ciro Santilli $$ Sponsor Ciro $$ 中国独裁统治 China Dictatorship 新疆改造中心、六四事件、法轮功、郝海东、709大抓捕、2015巴拿马文件 邓家贵、低端人口、西藏骚乱 has all strings -n20 strings, we can obtain the whole thing and clean it up a bit with:
wget -O all.html
cp all.html all-recode.html
recode html..ascii all-recode.html
awk '!seen[$0]++' all-recode.html > all-uniq.html
awk to skip the gazillion "mined by message" repeats.
A lot of in that website stuff appears to be cut up at the 20 mark. As shown in Force of Will, this is possibly because they didn't use -w in strings -n20, and the text after the newlines was less than 20 characters.
That website can be replicated by downloading the Bitcoin blockchain locally, then:
cd .bitcoin/blocks
for f in blk*.dat; do strings -n20 -w $f | awk '!seen[$0]++' > ${f%.dat}.txt; done
tail +n1 *.txt
Remove most of the binary crap:
head -n-1 *.txt | grep -e '[. ]' | grep -iv 'mined by' | less