spider.r
a handy search tool and intro to REBOL
bots
Bots section

21 September '99
by sonofsamiam
Courtesy of Fravia's searchlores.org
very slightly edited
by fravia+
fra_00xx
980921
sonofsamiam
1000
BO
PC
Ahah... a new language for bot building... seems pretty impressive as well...
Will we produce REBOL-bots to battle PERL-bots?
There is a crack, a crack in everything That's how the light gets in
Rating
(X)Beginner ( )Intermediate ( )Advanced ( )Expert

REBOL is a good language for internet, i don't know if it's as good as perl yet, but it's coming along. It's very flexible and easy to work with.
spider.r
a handy search tool and intro to REBOL
Written by sonofsamiam


Introduction
REBOL is an extremely simple but powerful language. It's designed entirely for the internet, so you don't have to muck around w/ protocols (unless you want to). What you're gonna make is your own (simple) search spider that outputs the results in a formatted html page. Very handy.

Tools required


Essay

I noticed that all the bots in the bots section were written in perl. I like perl a lot, but when i just want to get something done i usually open up my rebol interpreter. it's more portable, and generally needs less code to get stuff done. So here's a rebol-bot for a change of pace. One large drawback is that, as a new language, there are few resources to learn about it. The tutorials on the rebol website are helpful, but don't go into enough detail for me. Another place to look is rebol.org, a new script repository. That should be enough to get you started, so I won't be going over all th' basics here. The complete source code to the bot is at the end.

Oh, I know my programming style sucks, I probably should have just stuck the bosch function in the while loop, but it's more clear to me this way. Apollyloggies to the confused ones.

Some people have shied away from rebol because of it's lack of regular expressions, but it can accomplish everything that regexes can w/ it's parse rules. This is just another format for expression matching. There is a great tutorial on the rebol site that shows how to use them. So you can follow what goes on in this essay, here is the summary of the rules:

   |       - specify alternate rule
   [block] - sub-rule grouping
   none    - match nothing (catch on no match)

   12      - repeat pattern 12 times
   1 12    - repeat pattern 1 to 12 times
   0 12    - repeat pattern 0 to 12 times
   some    - repeat one or more times
   any     - repeat zero or more times

   skip    - skip any character (or chars if repeat given)
   to      - advance input to the given string (or char)
   thru    - advance input thru the given string (or char)

   copy    - copy the next match sequence to a variable
   (paren) - evaluate a REBOL expression

   word    - look-up value of a word
   word:   - mark the current input series position
   :word   - set the current input series position
   'word   - (reserved for dialecting support)


Now we get to the bot! All the variables are set up ahead of time, not prompted. I think it's quicker this way...

;these are the url(s) you want the spider to start the
search on...
urls: [
    http://www.foo.com
    http://www.ceb.org/bar.htm
]
;how many levels deep you want the spider to go...the # of links increases
;almost exponentially, so watch out!
deep: 4
;a block of words to search for.
keywords: ["fravia" "REBOL"]
summary_size: 100


urls is a block of urls that you want to start your search on. deep is how many levels you want to search. For instance, if bar.htm and foo.com had links to 7 pages all together, then those 7 pages would be the next level, and all their links would be searched next.

level: 0
format: reduce [<html> <head>
<title> "Search results from 
" now/date
    {</title></head><body
bgcolor=#C0C0C0 text=#001010 
vlink=#405040><center>
    <h1>spider_search</h1>by
sonofsamiam<table border=1>
    <th bgcolor=#ff0000>url<th
bgcolor=#ff0000>keyword<th 
bgcolor=#ff0000>
    count<th bgcolor=#ff0000>summary}
]
links: [] ;block to hold the links.
db: []    ;block to store the database for sorting.
out: []   ;block to store the outputted html

This sets up the header for the output file. Notice that sometimes the tags aren't in quotes, that's because rebol has a built-in html tag datatype.

;this is the sample html parser off the rebol
home page. works good 
for me!
html-code: [
    copy tag ["<" thru ">"] (append tags tag) |
    copy txt to "<" (append text txt)
]

html-code is a parse rule. it searches for all the <'s and puts all the tags in a block and all the text in a block. It doesn't handle weird html, w/ tags inside tags & stuff. It will still work fine, but the summary might be lacking.

;...Hieronymous Bosch...
;this function slurps the data from the page
bosch: func [page url][
    tags: make block! 100
    text: make string! 8000
    parse page [to &quot;<&quot; some html-code]
    ;get the links
    foreach tag tags [
        if parse tag [&quot;<a&quot; thru
&quot;href=&quot;
            [{&quot;} copy link to {&quot;} | copy
link to &quot;>&quot;]
            to end
        ][append links link]
    ]
    foreach keyword keywords [
        c: 0
        a: text
        while [a: find/tail a keyword][c: c + 1]
        either (c = 0) [
            links: []
        ][
            insert/only db reduce [c url keyword
copy/part text 
summary_size]
        ]
    ]
]

here is the function (bosch) that is called w/ the contents of each page. It grabs all the links, searches for the keywords, and then sticks the info into a database, stored in db

;!_!_!_this_is_where_it_starts_!_!_!
while [level <= deep][
    foreach url urls [bosch read url url]
    urls: links
    links: []
    level: level + 1
]

;sort & format
db: sort db
foreach x db [
    foreach [c u k t] x [
        insert out reduce [<tr> <td> u
<td> k <td> c 
<td> t newline]
    ]
]
insert out format
append out [</table> </center> "thanx for
using sonofsamiam's 
spider!"
    </body> </html>
]

;write the html file
write %spider.htm out
q

I think all this is pretty self-explanatory. The while loop controls what pages it searches & the rest formats the data into an html page. Then out is written to a file and the interpreter exits. It's speedy as hell, and it's helped my searching a huge amount.

I put in what I'm looking for in Altavista, and then search those results w/ my spider. You find info much quicker and easier this way.

Here is a sample output page:

spider_search

by sonofsamiam
urlkeyword countsummary
rt_bot1.htmfravia9 rt_bot1.htm The HCUbot: a simple Web Retrieval Bot in Perl The HCUbot: a simple Web Retriev
hunt_01a.htmfravia8Hunting Lesson I _____________________________________________________________________ ��>>&gt
thanx for using sonofsamiam's spider!

Now, this robot is very simple. If it comes across a bad url, it can die. Also, the search is limited to just single words, not any boolean or anything. What you should do is make it 'smart.' This isn't hard, if you know a little rebol. Consider using parse rules for a search string. Anyway, I hope you enjoy this little spider. I've had a lot of fun with it, and I'd be interested in any comments you have on it.

.~the full code~.

Here is the full source code. It's very small, as you can see, most of the space is taken up w/ html-formatting. REBOL is pretty efficient.

REBOL[
    Title:  "spider.r"
    Author: "sonofsamiam"
    Home:   http://sonofsamiam.tsx.org/
    Date:   19-Sep-1999
    Purpose: {
        A helpful little web-indexing search bot.
Outputs sorted & html- formatted.
    }
    Comment: {
        I curbed my usual programming style of cramming the entire
        script on 5 lines :p I figure most of the readers won't be
        especially familiar w/ rebol, so i went for clarity.
    }
]

secure none

urls: [
    http://www.rebol.com
    %rt_bot1.htm
    %hunt_01a.htm
]
deep: 4
keywords: ["fravia" "REBOL"]
summary_size: 100

level: 0
format: reduce [<html> <head>
<title> "Search results from 
" now/date
    {</title></head><body
bgcolor=#C0C0C0 text=#001010 
vlink=#405040><center>
    <h1>spider_search</h1>by
sonofsamiam<table width="100%" 
border=1>
    <th bgcolor=#ff0000>url<th
bgcolor=#ff0000>keyword<th 
bgcolor=#ff0000>
    count<th bgcolor=#ff0000>summary}
]
links: []
db: []
out: []
html-code: [
    copy tag ["<" thru ">"] (append tags tag) |
    copy txt to "<" (append text txt)
]

bosch: func [page url][
    tags: make block! 100
    text: make string! 8000
    parse page [to &quot;<&quot; some html-code]
    foreach tag tags [
        if parse tag [&quot;<a&quot; thru
&quot;href=&quot;
            [{&quot;} copy link to {&quot;} | copy
link to &quot;>&quot;]
            to end
        ][append links link]
    ]
    foreach keyword keywords [
        c: 0
        a: text
        while [a: find/tail a keyword][c: c + 1]
        either (c = 0) [
            links: []
        ][
            insert/only db reduce [c url keyword
copy/part text 
summary_size]
        ]
    ]
]

while [level <= deep][
    foreach url urls [bosch read url url]
    urls: links
    links: []
    level: level + 1
]

db: sort db
foreach x db [
    foreach [c u k t] x [
        insert out reduce [<tr> <td> u
<td> k <td> c 
<td> t newline]
    ]
]
insert out format
append out [</table> </center> "thanx for
using sonofsamiam's 
spider!"
    </body> </html>
]

write %spider.htm out
q
Final Notes
REBOL is really platform-independent, and all versions work exactly the same, so it's quite portable.
This is just a small example to give you ideas on how to use rebol for robots.
Bots are nothing short of your most useful tool on the web, and everyone should take the time to at least understand how they work, even if you're too lazy to make your own. The web is moving more and more in the direction of automation.

keep it real...

"How does it feel? To be using a borrowed recycled drug they told you was real?"
-geek-core kings Don't Know