Next: BitTorrent and huge files, Previous: Integration with Web feeds, Up: Integration with existing software [Index]
Simple HTML web page can be downloaded very easily for sending and viewing it offline after:
$ wget http://www.example.com/page.html
But most web pages contain links to images, CSS and JavaScript files, required for complete rendering. GNU Wget supports that documents parsing and understanding page dependencies. You can download the whole page with dependencies the following way:
$ wget \ --page-requisites \ --convert-links \ --adjust-extension \ --restrict-file-names=ascii \ --span-hosts \ --random-wait \ --execute robots=off \ http://www.example.com/page.html
that will create www.example.com directory with all files necessary to view page.html web page. You can create single file compressed tarball with that directory and send it to remote node:
$ tar cf - www.example.com | zstd | nncp-file - remote.node:www.example.com-page.tar.zst
But there are multi-paged articles, there are the whole interesting
sites you want to get in a single package. You can mirror the whole web
site by utilizing wget
’s recursive feature:
$ wget \ --recursive \ --timestamping \ -l inf \ --no-remove-listing \ --no-parent [...] \ http://www.example.com/
There is a standard for creating
Web ARChives:
WARC. Fortunately again, wget
supports it as an
output format.
$ wget [--page-requisites] [--recursive] \ --warc-file www.example.com-$(date '+%Y%M%d%H%m%S') \ --no-warc-keep-log --no-warc-digests \ [--no-warc-compression] [--warc-max-size=XXX] \ [...] http://www.example.com/
Or even more simpler crawl utility written on Go too.
That command will create www.example.com-XXX.warc web archive. It could produce specialized segmented gzip and Zstandard indexing/searching-friendly compressed archives. I can advise my own tofuproxy software (also written on Go) to index, browse and extract those archives conveniently.