essays |
---|
Since I personally don't like such people at all (and since I can't
punish them in other ways - at least for now : "A searcher could
develop hisself into a dangerous fellow, if he needs to" - Fravia+), I
have deceided to write some essays about these interesting
lores. I hope that the more seekers as
possible will work on the tricks I have found, improve them, and eventually, make
the commercial bastards out of business, because the
*knowledge* they are trying to sell or use will be free for anyone...
Nota bene: you have to work on these findings on your own. Unfortunately, I don't
have the time (nor some other things) to work all day on them. Consider this
small essay a 'starting path'.
Introduction
What are the primary tools all searchers use? The search
engines. Quite Logic.
How much of the Web does these search engine index? Nobody know it
precisely (most estimates move between 25 and 40%). But the fact is that no search engine covers all the Web.
Consequently, provided you know what it means, selecting your
search engine in regards of its size could be a good idea.
In fact, the size of a search engine is important for various reasons. First, the
more sites there are, the more results you are supposed to have.
Then, the more sites there are, the more chances you have to find
interresting results with a complex query. Finally, the more sites
there are, the more advanced the interface with them (that is, the
search engine query possibilities) will be (you can't index 300+ million sites
with the same search options that 50+ million).
Nevertheless, do not forget some old truths. The size of a search engine has
nothing to do (or so few) with the number of unique results it gives.
THIS point is one of the more important for us (it will be dealt in
another essay). Besides, the size of a search engine is directly
linked to the number of 404 errors it gives, more sites to manage
meaning necessarely more problems with the quicksand Web.
Let's go
All this must be quite interresting, but how do we find the sizes of
our search engines? I mean, without relying on commercial
estimations, which are most of the time completely useless for us.
The answer is with our brains. Actually, why a searchable index of
sites couldn't display all the sites it have in memory (provided it
doesn't have timeout limitations or the like)? That would be
illogical...
With this constatation in mind, I've played a little with some big
search engine and have found interresting results that I share here. Enjoy,
and work on them.
Northernlight attack
As you have probably seen elsewhere on the Fravia's site, it is
indeed possible to have the entire size of Northernlight. Just run
this query:
http://www.northernlight.com/nlquery.fcg?cb=0&qr=search+or+not+search&orl=2%3A1
and you will see how much URLs does
Northernlight have today.
I will give some explainations about that trick.
OR and NOT are boolean operators; they are used to enable a
kind of logic with the terms they are relating to.
For example, the query "search OR seek" will give you all the
pages of an index which have the word search OR the word seek in
them. Of course, they could also have the both words.
Always for example, the query "NOT search" will give you all the
pages of an index which have NOT the word search in them.
You should be close to 2 conclusions.
The first : ORing a search broads it, NOTing a search narrows it.
The second : given that the query "search" give all the pages with
the word search in them and that the query "NOT search" give all
the pages that have NOT the word search in them, ORing these
two querries will give the entire size of a database.
This second conclusion is used with Northernlight.
As soon as a search engine supports the OR and NOT operators, try these
kind of query. Nevertheless, the search engine in question must support
entirely the NOT operator. That is, not the "NOT between two terms":
the NOT with only one term.
Infoseek attack
Infoseek is also kindly enough to allow us to get its full size
quickly. Just run this query:
http://infoseek.go.com/Titles?qt=url%3Ahttp&svx=home_searchbox&sv=IS&lk=noframes&nh=25
As you will notice, here I have used the url search.
Actually, Infoseek allows searchers to search for words in the
URLs of considered documents.
For example, if you want to find all the indexed sites with "search"
in their URLs, just made an url search on "search" and appreciate
the results.
One thing should now make "tilt" in your mind.
Given that most of the search engine index Web sites, ALL the URLs in their
database will begin with "http".
So, a query like "url:http" should give us all the sites which adress
is beginning with "http"... In other words, this query will give us the
full index of a search engine
This trick is used with Infoseek.
Of course, it could also be used with Northenlight; see below:
http://www.northernlight.com/nlquery.fcg?dx=1004&qr=&qt=&pu=&qu=http
&si=&la=All&qc=All&d1=&d2=&rv=1&search.x=43&searchy=10#
As soon as a search engine allows an URL search, try it to see what
happens about the number of sites returned. In fact, some big search engine
have limited their URL search to the part AFTER the "//" of "http://".
So, the trick used here won't be of any use. But there are other
possibilities, many others....
FAST attack
FAST claims to have 300+ million sites. I'm happy to say they are
right :-)
Like Infoseek or Northenlight, FAST is a good search engine for size study.
Try this query, and tell me whether you like it or not:
http://www.ussc.alltheweb.com/cgi-bin/advsearch?terms=3&type=any&query=&lang=any&A1=&
B1=http&C1=url.all%3A&A2=&B2=&C2=&A3=&B3=&C3=&dincl=&dexcl=&hits=10&exec=FAST+Search
FAST is a special search engine "en ce sens que" it doesn't
support keywords searching. Nor boolean for that matter. You are compelled to
use a form in order to search effectively. But, we don't care : new
experiences are searchers' everyday bread...
One of the FAST advanced option is to put word filters for our
queries. What is interesting, is that you can define the elements of
a page which have to contain some desired words: title, text,
links, and ..URLs.
Consequently, if we just choose that the pages we want must
contain the word "http" in their URLs, the results should be all the
sites of the database. And loo! This happens.
Similarly to the URL search, if you want the size of a search engine that
allows you to filter its results by fields, you should play a little with
the URL filter "http". Note, however, that very few search engine allow this
kind of filtering.
Lycos attack
Actually, I do not know if the results of this attack really mean what I believe.
Just click there:
http://lycospro.lycos.com/srchpro/?aloc=sb_init_field&first=1&lpv=1&
type=advwebsites&query=&t=all&qt=&qu=http&qh=&x=26&y=5
and you will notice that the number of results is EXACTLY the
same as the FAST one.
Consequently, I begin to believe that these two share a common database (I may be
wrong of course, awaiting your own conclusions).
Hotbot attack
Hotbot is also a special case, because it belongs to Lycos. So, I
think it shares the Lycos/FAST database as well. Nevertheless,
finding evidences of that is quite *difficult*, because the options in
Hotbot are specials.
Here, no URL search nor refining, and no NOT trick could be used.
We have to think.
Let's peruse the help. We see that some fields searching are
allowed, among other oddities : depth, within, before, after,...
After trying a little, I've come to the conclusion that the query:
http://hotbot.lycos.com/text/default.asp?MT=depth%3A4+feature%3Aacrobat
+feature%3Aapplet+feature%3Aactivex+feature%3Aaudio+feature%3Aembed+feature%3Aflash
+feature%3Aform+feature%3Aframe+feature%3Aimage+feature%3Ascript+feature%3Ashockwave
+feature%3Atable+feature%3Avideo+feature%3Avrml&search=SEARCH&SM=B&DV=0&LG=any&DC=10&DE=2
is the one which
give the more results with Hotbot.
The trick is to OR all the possible search options, in order to broad
as much as possible the number of results.
The only problem is that the number returned is not the size of the
entire database, but only of a big part of it. So, it's just a more or
less precise estimation.
By the way, you will notice that there are more sites in the Hotbot
database than in the FAST/Lycos one. This shouldn't be so, because, as
said, Hotbot BELONGS to Lycos.
The explanation might by the following : Hotbot was bought by
Lycos, but before, it was an independant search engine, with its own index.
So, I think that now, Hotbot uses the FAST/Lycos database in
addition of its previous one.
Yahoo attack
I won't even bother searching to list all the URLs at Yahoo,
because it's already done. Actually, for each category, Yahoo
display the number of sites it has recorded. So, with some simple
maths (additions that is) you can easily find the global size of the Yahoo
database.
Of course, you don't have to display all the pages each time you
want to count the number of sites. Just build your own "Yahoo size
fetcher" bot and let him work for you. The section about
bot-building could be an interesting beginning for this point. Keep
in mind that the number of reported sites is always enclosed into parenthesis; this
will make things easy for your script.
Altavista attack
This one is hard, due to Alta's damned "timeout" limit.
Given that Atlavista supports the NOT search, let's try this query:
http://www.altavista.com/cgi-bin/query?hl=on&q=%28search+OR+NOT+search%29&
search=Search&r=&kl=XX&stype=&pg=aq&text=yes&d0=&d1=
It's
impossible!! Altavista's database cannot be so small.
Given that Altavista supports the URL search, let's try this query:
http://www.altavista.com/cgi-bin/query?hl=on&q=url%3Ahttp&
search=Search&r=&kl=XX&stype=&pg=aq&text=yes&d0=&d1=
Impossible!! Something must be
wrong.
Actually, Altavista has a timeout. That is, it doesn't scan all its
database. Consequently, we can never have all the sites it has indexed
this way.
But, we can play a little with the Date filtering. Remember:
filtering can fetch incredible results if used correctly.
After some tries, I've discovered these two complementary queries:
http://www.altavista.com/cgi-bin/query?hl=on&q=search+OR+NOT+search&
search=Search&r=&kl=XX&stype=&pg=aq&text=yes&d0=01%2F01%2F70&d1=01%2F01%2F00
and
http://www.altavista.com/cgi-bin/query?hl=on&q=search+OR+NOT+search&
search=Search&r=&kl=XX&stype=&pg=aq&text=yes&d0=01%2F01%2F00&d1=
The addition of the two results give something strange yet.
Provided the engineers at Altavista are not totally incompetent with
their Date system, all the estimations I have seen until now about
this search engine size seem to be wrong:
Altavista seems to have nearly 430+ million of sites indexed !!
Other search engine attack
I hope you will find more tips and tricks on your own. Each search engine
can be "reversed" differently. It's up to you to find the "magic
queries"!!
Conclusion
You now have enough material to work on this "search engine size" stuff. The principle is
simple: use queries that returns the more possible results.
You could also think about writing a "search engine size survey" bot. In fact,
running all these queries each time you want to compare the search engine
sizes would be insane.
Should you want to send me your additions / critics / anything
else, feel free to write to [email protected]. (note for
Fravia+ please do not "protect" this email
adress of mine: I'm awaiting spam bots with some tools... :-)
Rumsteack, from France