Download a complete website using wget

In some rare cases you might run into the need to download a complete website onto your local machine for offline usage. There are many GUI tools to do this with loads of options. But did you know that you can do this with a shell command using a utility that you most likely already have installed?

Depending on your needs, downloading the html content of the website is just half of the job. You will need all the images, style-sheets and other linked media files. Linux provides already a command line tool that can do this and on many distributions this little tool might already be installed. The tool is called wget(1).

In case wget is not installed yet, install the package “wget” from whatever package repository is used by your distribution.

$ wget --mirror --convert-links --backup-converted --page-requisites --continue --adjust-extension --restrict-file-names=windows   https://www.tinned-software.net 2>&1 | tee wget_download.log | grep "\-\-"
--2020-07-08 16:01:56--  https://www.tinned-software.net/
Last-modified header missing -- time-stamps turned off.
--2020-07-08 16:01:57--  https://www.tinned-software.net/robots.txt
--2020-07-08 16:01:57--  https://www.tinned-software.net/favicon.ico
--2020-07-08 16:01:57--  https://www.tinned-software.net/css/style_screen.css?v=3
...

Once wget is installed, the command above will download a whole website. This is what the arguments used above mean:

  • –mirror Turn on options suitable for mirroring. This option turns on recursion and time-stamping and sets infinite recursion depth.
  • –convert-links This option will rewrite any links so the downloaded content can be browsed locally.
  • –backup-converted This option created a backup of the downloaded content before converting it.
  • –page-requisites This option causes wget to download all the files that are necessary to properly display a given HTML page.
  • –continue Continue the download of already partially downloaded files.
  • –adjust-extension Ensures a file-extension matching the content type delivered by the server is used to store the file.
  • –restrict-file-names=windows Ensures the filenames do not contain any characters not allowed in filenames. Windows is used here as it is a bit more restrictive than “unix”.

The last two arguments are getting more and more important as many modern URLs don’t include a file extension (and some even use non-ASCII characters).

The output is piped into the command tee(1) to redirect the output to a file and as well as to the console. The grep(1) is only used to show the files downloaded.

The result is a local copy of the website with all links rewritten so they are not pointing to the website but to the local files on disk.

Tip

Checkout the manpage of wget(1) as there are many more settings that might be useful. You can define the depth of links to follow, the number of retries if a page could not be downloaded and more.

Observations

While testing the above command, I noticed that there are a few rare cases where the rewriting of the links does not catch all the possible ways the resource URLs are constructed, but in most cases, this produces a full copy of the website which can be opened and navigated through offline.


Read more of my posts on my blog at https://blog.tinned-software.net/.

This entry was posted in Linux Administration, Web technologies and tagged , . Bookmark the permalink.