I noticed that all the bots in the bots section
were written in perl. I
like perl a lot, but when i just want to get something
done i usually
open up my rebol interpreter. it's more portable, and
generally needs less
code to get stuff done. So here's a rebol-bot for a
change of pace. One
large drawback is that, as a new language, there are
few resources to learn
about it. The tutorials on the rebol website are
helpful, but don't go into
enough detail for me. Another place to look is rebol.org, a new
script repository. That
should be enough to get you started, so I won't be
going over all th' basics
here. The complete source code to the bot is at the end.
Oh, I know my programming style sucks, I probably should have just stuck the bosch function in the while loop, but it's more clear to me this way. Apollyloggies to the confused ones.
Some people have shied away from rebol because of
it's lack of regular
expressions, but it can accomplish everything that
regexes can w/ it's
parse
rules. This is just another format
for expression
matching. There is a great tutorial on the rebol site
that shows how to use
them. So you can follow what goes on in this essay,
here is the summary of
the rules:
| - specify alternate rule [block] - sub-rule grouping none - match nothing (catch on no match) 12 - repeat pattern 12 times 1 12 - repeat pattern 1 to 12 times 0 12 - repeat pattern 0 to 12 times some - repeat one or more times any - repeat zero or more times skip - skip any character (or chars if repeat given) to - advance input to the given string (or char) thru - advance input thru the given string (or char) copy - copy the next match sequence to a variable (paren) - evaluate a REBOL expression word - look-up value of a word word: - mark the current input series position :word - set the current input series position 'word - (reserved for dialecting support)
;these are the url(s) you want the spider to start the search on... urls: [ http://www.foo.com http://www.ceb.org/bar.htm ] ;how many levels deep you want the spider to go...the # of links increases ;almost exponentially, so watch out! deep: 4 ;a block of words to search for. keywords: ["fravia" "REBOL"] summary_size: 100
urls
is a block of urls that
you want to start your
search on. deep
is how many levels you
want to search. For
instance, if bar.htm and foo.com had links to 7 pages
all together, then
those 7 pages would be the next level, and all their
links would be searched
next.
level: 0 format: reduce [<html> <head> <title> "Search results from " now/date {</title></head><body bgcolor=#C0C0C0 text=#001010 vlink=#405040><center> <h1>spider_search</h1>by sonofsamiam<table border=1> <th bgcolor=#ff0000>url<th bgcolor=#ff0000>keyword<th bgcolor=#ff0000> count<th bgcolor=#ff0000>summary} ] links: [] ;block to hold the links. db: [] ;block to store the database for sorting. out: [] ;block to store the outputted html
;this is the sample html parser off the rebol home page. works good for me! html-code: [ copy tag ["<" thru ">"] (append tags tag) | copy txt to "<" (append text txt) ]
html-code
is a parse
rule.
it searches for all the
<'s and puts all the tags in a block and all the
text in a block. It
doesn't handle weird html, w/ tags inside tags &
stuff. It will still work
fine, but the summary might be lacking.
;...Hieronymous Bosch... ;this function slurps the data from the page bosch: func [page url][ tags: make block! 100 text: make string! 8000 parse page [to "<" some html-code] ;get the links foreach tag tags [ if parse tag ["<a" thru "href=" [{"} copy link to {"} | copy link to ">"] to end ][append links link] ] foreach keyword keywords [ c: 0 a: text while [a: find/tail a keyword][c: c + 1] either (c = 0) [ links: [] ][ insert/only db reduce [c url keyword copy/part text summary_size] ] ] ]
here is the function (bosch
) that is
called w/ the contents of
each page. It grabs all the links, searches for the
keywords, and then
sticks the info into a database, stored in
db
;!_!_!_this_is_where_it_starts_!_!_! while [level <= deep][ foreach url urls [bosch read url url] urls: links links: [] level: level + 1 ] ;sort & format db: sort db foreach x db [ foreach [c u k t] x [ insert out reduce [<tr> <td> u <td> k <td> c <td> t newline] ] ] insert out format append out [</table> </center> "thanx for using sonofsamiam's spider!" </body> </html> ] ;write the html file write %spider.htm out q
I think all this is pretty self-explanatory. The
while
loop
controls what pages it searches & the rest formats the
data into an html
page. Then out is written to a file and the
interpreter exits.
It's speedy as hell, and it's helped my searching a
huge amount.
I put in what I'm looking for in Altavista, and then search those results w/ my spider. You find info much quicker and easier this way.
Here is a sample output page:
url | keyword | count | summary |
---|---|---|---|
rt_bot1.htm | fravia | 9 | rt_bot1.htm The HCUbot: a simple Web Retrieval Bot in Perl The HCUbot: a simple Web Retriev |
hunt_01a.htm | fravia | 8 | Hunting Lesson I _____________________________________________________________________ ��>>> |
Now, this robot is very simple. If it comes across a
bad url, it can die.
Also, the search is limited to just single words, not
any boolean or
anything. What you should do is make it 'smart.' This
isn't hard, if you
know a little rebol. Consider using parse
rules for a search
string. Anyway, I hope you enjoy this little spider.
I've had a lot of fun
with it, and I'd be interested in any comments you
have on it.
.~the full code~.
Here is
the full source code.
It's very small, as you can see, most of the space is
taken up w/
html-formatting. REBOL is pretty efficient.
REBOL[ Title: "spider.r" Author: "sonofsamiam" Home: http://sonofsamiam.tsx.org/ Date: 19-Sep-1999 Purpose: { A helpful little web-indexing search bot. Outputs sorted & html- formatted. } Comment: { I curbed my usual programming style of cramming the entire script on 5 lines :p I figure most of the readers won't be especially familiar w/ rebol, so i went for clarity. } ] secure none urls: [ http://www.rebol.com %rt_bot1.htm %hunt_01a.htm ] deep: 4 keywords: ["fravia" "REBOL"] summary_size: 100 level: 0 format: reduce [<html> <head> <title> "Search results from " now/date {</title></head><body bgcolor=#C0C0C0 text=#001010 vlink=#405040><center> <h1>spider_search</h1>by sonofsamiam<table width="100%" border=1> <th bgcolor=#ff0000>url<th bgcolor=#ff0000>keyword<th bgcolor=#ff0000> count<th bgcolor=#ff0000>summary} ] links: [] db: [] out: [] html-code: [ copy tag ["<" thru ">"] (append tags tag) | copy txt to "<" (append text txt) ] bosch: func [page url][ tags: make block! 100 text: make string! 8000 parse page [to "<" some html-code] foreach tag tags [ if parse tag ["<a" thru "href=" [{"} copy link to {"} | copy link to ">"] to end ][append links link] ] foreach keyword keywords [ c: 0 a: text while [a: find/tail a keyword][c: c + 1] either (c = 0) [ links: [] ][ insert/only db reduce [c url keyword copy/part text summary_size] ] ] ] while [level <= deep][ foreach url urls [bosch read url url] urls: links links: [] level: level + 1 ] db: sort db foreach x db [ foreach [c u k t] x [ insert out reduce [<tr> <td> u <td> k <td> c <td> t newline] ] ] insert out format append out [</table> </center> "thanx for using sonofsamiam's spider!" </body> </html> ] write %spider.htm out q
keep it real...