If you are a student that have to work with Moodle at University, or a course, maybe you are thinking in download all the pages, blocks, documents, attachments… everything to you hard disk. But obviusly, you are thinking in download it automaticlly, not by hand. As admin, it’s easy to do this, but with the student role its more difficult to get
This post is for you 😉
Working with wget
wget has almost you need to mirror a complete site that doesn’t have form authentication (problem 1) neither links created dinamically (problem 2).
Taking a look at the wget man page it’s easy to find the correct options to download a complete site, this is a mirror.
wget -m -E -k http://moodlesite.com
-m: Turn on options suitable for mirroring. It is currently equivalent to ‘-r -N -l inf –no-remove-listing’
-E: If a file of type ‘application/xhtml+xml’ or ‘text/html’ is downloaded and the URL does not end with the regexp ‘\.[Hh][Tt][Mm][Ll]?’, this option will cause the suffix ‘.html’
-k: After the download is complete, convert the links in the document to make them suitable for local viewing.
But in a moodle site, with this command the only thing we will get it’s the login screen, because it contains a non http authentication form that you must fill to enter with your user.
Authenticating over the form
After trying some http technics finally I found this FAQ: Using wget – How do I use wget to download pages or files that require login/password? and with some HTML search I’ve found that I need to get a cookie first, and afterthat I need to use it.
Getting the cookie
wget --load-cookies my-cookies.txt \
--save-cookies=my-cookies.txt --keep-session-cookies http://moodlesite.com/
Obviusly you must replace YOUR_USERNAME and YOUR_PASSWORD with the appropiate login information. As a note, we need to mark –keep-session-cookies to indicate that we want to save session cookies in the file, by default wget doesn’t save it.
Using the cookie
wget --load-cookies my-cookies.txt --keep-session-cookies --save-cookies my-cookies.txt --referer=http://moodlesite.com/login/index.php -m -E -k http://moodlesite.com/course/view.php?id=XXX
We have added options to use the saved cookie and update them in case of some change.
But… it will not work completely yet…
Problems with the content
I’ve lost a lot of time because of the content. After an hour I realised that the first link of the page is ‘Logout’… And the wget crawled him, unvalidating the session!! So you must reject some URLs to be able to downlad it. With some try&error finally I decided to use this:
wget --load-cookies my-cookies.txt --keep-session-cookies --save-cookies my-cookies.txt\
--referer=http://moodlesite.com/login/index.php -m -E -k
- logout* to prevent the session invalidation
- cal_m, cal_y and the directory /calendar because the month navigation it’s a dynamic generated link. This means, you can view the calendar from year 200 until 32516. This a lot of downloading time.
- post.php* to not answer nothing, subscribe* to prevent the subscription to all the news, help.php to not getting the online help and enrol.php* to not enter on the activities
With this, and a lot of time, you will be able to mirror the site. The problem is that rejected list means that wget will not save the rejected file to the disk, but it WILL analyze it to find more links to crawl. This is the cause that wget will try to download all the months of all the available years in Moodle. Taking 2 seconds to download a month, this are 9 days downloading garbage. And there’s no way to change this behaviour of wget through a parameter or configuration.
if (descend && acceptable (file)) //TRM
Save the file, compile it (
./configure && make && sudo make install) and that’s all! Now with this change and the last command, wget will not analyze the rejected pages to find links, solving the calendar problem.