The previous example was only using one single file and this example
will show you that it is just as simple when the book is divided up
into several documents. We will use The Linux System Administrators'
Guide by Lars Wirzenius and Joanna Oja as our test object.
Unpacking the files in a separate directory (we will use /tmp/sag/ for this example) we will find several HTML documents and also a bunch of GIF images. We are interested in the images to find out what bit depth we have to use and also their size so we know if they will be scaled down. In this case all images are black and white, so we can use the default bit depth of 1. A few of the images are quite large, so they will be scaled down to 150x250 by the parser. If we really want these images in full size we can either change the HTML document so that instead of including the image in the document it will link to the image, i.e. instead of,
we would use,
<A HREF="overview-kernel.gif" BPP=1 MAXWIDTH=700 MAXHEIGHT=700>overview-kernel.gif</A>
Then we can tap on the link to the image when we want to view it.
This is something we can only do when we have access to the document
and to support this in a more transparent way the parser should be
able to do this automatically for you in the future.
Since we are not interested in any external documents we will use the --stayonhost option and a high maximum depth. Then we don't have to worry about exactly how deep we should follow links and what external links we should filter out using exclusion lists. Now we are ready to build the database,
% Spider.py -v --stayonhost -M5 -H file:/tmp/sag/index.html -N "Linux Admin Guide" -f DB/SAG
Working for pluckerdir /home/pilot/.plucker Processing file:/tmp/sag/index.html. 0 collected, 0 still to do Retrieved ok : Processing file:/tmp/sag/backup-timeline.gif. 73 collected, 0 still to do Retrieved ok Writing out collected data... Writing db 'Linux Admin Guide' to file /home/pilot/.plucker/SAG.pdb Converted file:/tmp/sag/book1.html : Converted file:/tmp/sag/x89.html Wrote 1 <= plucker:/~special~/index Wrote 2 <= file:/tmp/sag/index.html Wrote 3 <= plucker:/~special~/pluckerlinks Wrote 11 <= file:/tmp/sag/backup-timeline.gif : Wrote 83 <= mailto:email@example.com Wrote 87 <= plucker:/~special~/links1 Done!
Install the database you find in /home/pilot/.plucker/DB and you have instant access to The Linux System Administrators' Guide.