next up previous contents index
Next: 4.5.2 Pluck remote targets Up: 4.5.1 Pluck local targets Previous: 4.5.1.1 Creating an E-book   Contents   Index

4.5.1.2 An Admin Guide

The previous example was only using one single file and this example will show you that it is just as simple when the book is divided up into several documents. We will use The Linux System Administrators' Guide by Lars Wirzenius and Joanna Oja as our test object.

Unpacking the files in a separate directory (we will use /tmp/sag/ for this example) we will find several HTML documents and also a bunch of GIF images. We are interested in the images to find out what bit depth we have to use and also their size so we know if they will be scaled down. In this case all images are black and white, so we can use the default bit depth of 1. A few of the images are quite large, so they will be scaled down to 150x250 by the parser. If we really want these images in full size we can either change the HTML document so that instead of including the image in the document it will link to the image, i.e. instead of,

<IMG SRC="overview-kernel.gif">

we would use,

<A HREF="overview-kernel.gif" BPP=1 MAXWIDTH=700 MAXHEIGHT=700>overview-kernel.gif</A>

Then we can tap on the link to the image when we want to view it. This is something we can only do when we have access to the document and to support this in a more transparent way the parser should be able to do this automatically for you in the future.

Since we are not interested in any external documents we will use the --stayonhost option and a high maximum depth. Then we don't have to worry about exactly how deep we should follow links and what external links we should filter out using exclusion lists. Now we are ready to build the database,

% Spider.py -v --stayonhost -M5 -H file:/tmp/sag/index.html -N "Linux Admin Guide" -f DB/SAG

Working for pluckerdir /home/pilot/.plucker
Processing file:/tmp/sag/index.html.
           0 collected, 0 still to do
  Retrieved ok

                    :

Processing file:/tmp/sag/backup-timeline.gif.
           73 collected, 0 still to do
  Retrieved ok

Writing out collected data...
Writing db 'Linux Admin Guide' to file /home/pilot/.plucker/SAG.pdb
Converted file:/tmp/sag/book1.html

                    :

Converted file:/tmp/sag/x89.html
Wrote 1 <= plucker:/~special~/index
Wrote 2 <= file:/tmp/sag/index.html
Wrote 3 <= plucker:/~special~/pluckerlinks
Wrote 11 <= file:/tmp/sag/backup-timeline.gif

                    :

Wrote 83 <= mailto:gregh@sunsite.unc.edu
Wrote 87 <= plucker:/~special~/links1
Done!

Install the database you find in /home/pilot/.plucker/DB and you have instant access to The Linux System Administrators' Guide.


next up previous contents index
Next: 4.5.2 Pluck remote targets Up: 4.5.1 Pluck local targets Previous: 4.5.1.1 Creating an E-book   Contents   Index
The Plucker Team