How to mirror a Moodle site with wget

06
Sep

If you are a student that have to work with Moodle at University, or a course, maybe you are thinking in download all the pages, blocks, documents, attachments… everything to you hard disk. But obviusly, you are thinking in download it automaticlly, not by hand. As admin, it’s easy to do this, but with the student role its more difficult to get

This post is for you 😉

 

Working with wget

wget has almost you need to mirror a complete site that doesn’t have form authentication (problem 1) neither links created dinamically (problem 2).

Taking a look at the wget man page it’s easy to find the correct options to download a complete site, this is a mirror.
wget -m -E -k http://moodlesite.com

-m: Turn on options suitable for mirroring. It is currently equivalent to ‘-r -N -l inf –no-remove-listing’
-E: If a file of type ‘application/xhtml+xml’ or ‘text/html’ is downloaded and the URL does not end with the regexp ‘\.[Hh][Tt][Mm][Ll]?’, this option will cause the suffix ‘.html’
-k: After the download is complete, convert the links in the document to make them suitable for local viewing.

But in a moodle site, with this command the only thing we will get it’s the login screen, because it contains a non http authentication form that you must fill to enter with your user.

Authenticating over the form

After trying some http technics finally I found this FAQ: Using wget – How do I use wget to download pages or files that require login/password? and with some HTML search I’ve found that I need to get a cookie first, and afterthat I need to use it.

Getting the cookie

wget --load-cookies my-cookies.txt \
--post-data='username=YOUR_USERNAME&password=YOUR_PASSWORD&testcookies=1'
--save-cookies=my-cookies.txt --keep-session-cookies http://moodlesite.com/

Obviusly you must replace YOUR_USERNAME and YOUR_PASSWORD with the appropiate login information. As a note, we need to mark –keep-session-cookies to indicate that we want to save session cookies in the file, by default wget doesn’t save it.

Using the cookie

wget --load-cookies my-cookies.txt --keep-session-cookies --save-cookies my-cookies.txt --referer=http://moodlesite.com/login/index.php -m -E -k http://moodlesite.com/course/view.php?id=XXX

We have added options to use the saved cookie and update them in case of some change.

But… it will not work completely yet…

Problems with the content

I’ve lost a lot of time because of the content. After an hour I realised that the first link of the page is ‘Logout’… And the wget crawled him, unvalidating the session!! So you must reject some URLs to be able to downlad it. With some try&error finally I decided to use this:

wget --load-cookies my-cookies.txt --keep-session-cookies --save-cookies my-cookies.txt\
--referer=http://moodlesite.com/login/index.php -m -E -k
--reject logout*,*cal_m*,*cal_y*,post.php*,*subscribe*,help.php*,enrol.php*
--exclude-directories=/calendar http://moodlesite.com/course/view.php?id=XXX

I’ve excluded:

  • logout* to prevent the session invalidation
  • cal_m, cal_y and the directory /calendar because the month navigation it’s a dynamic generated link. This means, you can view the calendar from year 200 until 32516. This a lot of downloading time.
  • post.php* to not answer nothing, subscribe* to prevent the subscription to all the news, help.php to not getting the online help and enrol.php* to not enter on the activities

With this, and a lot of time, you will be able to mirror the site. The problem is that rejected list means that wget will not save the rejected file to the disk, but it WILL analyze it to find more links to crawl. This is the cause that wget will try to download all the months of all the available years in Moodle. Taking 2 seconds to download a month, this are 9 days downloading garbage. And there’s no way to change this behaviour of wget through a parameter or configuration.

The solution

But with free software we always have a solution!! You can download the wget source code and change one line. On the src/recur.c file go to the line 365 and change it from (or apply this patch):

if (descend)

to:

if (descend && acceptable (file)) //TRM

Save the file, compile it (./configure && make && sudo make install) and that’s all! Now with this change and the last command, wget will not analyze the rejected pages to find links, solving the calendar problem.

  1. Razvan York 13/04/2014

    Thank you for this. I wanted to update and say this also works with wget 1.15 (the latest) although the line that is changed is on 362 (and not 365). I was able to successfully download my class’s moodle directory in its entirety with this command:
    wget –load-cookies cookies.txt –keep-session-cookies –save-cookies cookies.txt \
    –referer=https://moodlesitehere.com -m -E -k \
    –reject logout*,*cal_m*,*cal_y*,post.php*,*subscribe*,help.php*,enrol.php* \
    –exclude-directories=/calendar https://moodlesitehere.com/course/view.php?id=xxx \
    –no-check-certificate
    I had stored the cookies for the moodle site in cookies.txt by using the FireFox save cookies plugin. Thanks a ton!

  2. Tomàs Reverter 13/04/2014

    You’re welcome! 😀

  3. Ben S 10/05/2014

    Thanks! This was really helpful. I should have Googled for Moodle specifically; I stumbled onto this just looking for why I was having trouble with wget. The command I ended up with (adding the -p option and rejecting “flaginappropriate”) was:
    wget –load-cookies my-cookies.txt –keep-session-cookies –save-cookies my-cookies.txt\
    –referer=http://moodlesite.com/login/index.php -p -m -E -k
    –reject logout*,*cal_m*,*cal_y*,post.php*,*subscribe*,help.php*,enrol.php*,*flaginappropriate*
    –exclude-directories=/calendar http://moodlesite.com/course/