The web is uncharted and deep. The volume of easily located
information instantly accessible to a user is so massive as to be
incomprehensible. At the time I started writing this snippet of mine
(February 2000) there were supposed to
exist well over 1.500 millions indexable pages, expanding
exponentially.
In December 1997 the web had roughly 320 million pages.
In February 1999, a series of studies concluded that the web size
was about 800 million web pages, with about 15 trillion bytes
of textual information (one byte equates roughly one
text character), and 180 million images: about 3
trillion bytes of data. Now (September 2000) we have already "passed" the
2,000 million pages and the 600 million images marks.
NOTA BENE: I'm counting only the "publically available"
information. All information behind firewalls, on local intranets, and
all password-protected information, which is
available only (ahem, in theory :-) by filling out search forms, is NOT
even included. Since one of the aims of this site of mine is to allow you to access
this kind information as well, should you feel you need to,
the dimensions of what will
be available to you are indeed more than staggering. Believe me:
the volume of
information instantly accessible to you IS so massive as
to be (almost)
incomprehensible.
Of course the abovementioned data are just "the beginning":
the web continues to grow at an incredible pace. It
doubles in size in less than one year, with whole countries getting more and
more wired, some of them showing growth percentages well over 100%.
There are many scientific
studies available on this development: to give you a rough idea, let's say
that at the beginning
of September 2000 the web consists of - take or add a
few millions here and there...
2,200,000,000 pages;
37,000,000,000,000 bytes of text;
600,000,000 images;
10.000,000,000,000 bytes of image data.
Everyday more than 3 millions new pages are added to the web, with more than
60,000,000,000 new bytes of text and well over
1,000,000 new images (with
16,000,000,000 new bytes of image data).
These quotas are GROWING.
There are many sources of information on the deep web, and each of them deserves to
be searched using various techniques. Therefore, the real first
task is figuring out where to look.
You should also consider the fact that most
information CANNOT be found using the
'classical' search engines. The "largest" search engines cover (at best)
only a tiny part of the web. Moreover they DO NOT index the most interesting parts
of the web: they index commercial over educational sites, US sites over European sites
and 'popular' sites (read sites loved by the zombies) over relatively unknown
sites. Remember also that
each 'main' search engine has different strenghts and weaknesses, and that it would be
nonsense to use always the same search engine (say altavista) to search for
any target.
The information mass is overwhelming. To avoid drowning in the vast sea,
anyone navigating the web needs to know how to find (and use) all available tools and techniques.
You'll have immediate rewards:
Knowledge is power. This is particularly true - as you'll soon notice - when
searching the web. You will even be able to turn the knowledge you have acquired
into money, if you so wish, losing however
the capacity to acquire further knowledge in this process.
The value of critical information is increasing,
and such information is
now - for the first time in the history of our race - available for everybody (even if the volume of commercial rubbish you'll have to wade through is increasing exponentially).
You will also soon realize that finding and using some online resources can be
quite fun and entertaining. Keeping 'on track' during your searches
will be at times quite difficult indeed. So be it:
there is much more to life than just knowledge and work :-)
The web is a quicksand
The web is uncharted and deep. Moreover the web is a quicksand: web pages are changed,
removed or shifted
continuously.
Changes may be minor, major, or total. According to various projects
striving to create archive snapshots of major
portions of the web, the average lifespan of a webpage is between one and
two months, which means that in the last day about:
40,000,000 pages and
10,000,000 images changed
Now you have the dimension of the problem. No algorithm, no computer-processing-power, no
"battery" of ultra powerful supercomputers is capable to cope with this
tide of ever-shifting exponentially rising data.
Effective searching requires new methods.
This site will try to explain you some
of them.
The major search engines have a great deal of trouble
indexing any significant portion of the web. The engine with the largest
coverage (Fast: supposedly a billion pages) only
indexes 25% of the entire web, and some of the other 'main' search
engines only cover a meagre 5% of it. It is a tremendous problem
to index such a vast number of pages - for instance AltaVista
indexed 100 million pages in October 1997, only 140 million
in
August 1999, only 250 million in january 2000 and only 350 million now -
they obviously cannot keep up with
the growth of the web.
The search engines re-index slowly, and have
a field day even to keep their own databases "tidy", i.e. eliminating
all those annoying 404 for the pages that went missing between two
index-runs. Very seldom do they nowadays "follow the links" into uncharted
parts of the web. The rate of updating can be moreover extremely slow. A
good exercise is to
find an unlisted site and add it yourself to the main search engines. Go ahead and do it:
Google and
Northernlight will index a relatively significant part of
it in a couple of weeks, Altavista will index (almost) only the pages you have
manually entered (without following any links), whereby Fast/Alltheweb wont
update its indexes
for many MONTHS. So you have an additional problem with the big search engines: stale sites
abound, new sites wont be listed.
Let's see:
Problem number ONE: More than 2000
millions web sites by
the time you read this.
Please remember that all these numbers
are underestimations: nobody really knows how many sites there are out there.
Problem number TWO: search
engines that do not cope. The
most "rich"
search engine at the moment
covers far less than a third of
the existing total. Search engines will boast that they 'selected'
a smaller amount of pages, and in reality have visited many more (a dual number approach
now in vogue to cover search engines shortcomings :-) Fact is that their coverage is
- even in the best and most optimistic hypothesis - meager.
Taking account of the fact that
Internet is a "quicksand",
where sites disappear, move or are modified continuously, it is easy to
understand
that even the most powerful search engines are doomed: even
keeping updated
their already existing immense databases is getting a more and more
daunting task (hence the many
404 errors
when you make a search).
Conclusion: Other
techniques must be used to search
the web.
Other techniques must be used to search the
web
The web is uncharted and deep. Other techniques must be used to search
it.
Of course you will first have to learn how to EFFECTIVELY use the existing
search engines. You'll discover that there are quite a lot of
(interesting) differences among the main ones. You'll also have to
understand
which specific algorithms have been
implemented by their programmers. Some algorithms, as you will see, involve
tracking how many sites link to a page and then increasing
the 'weight' of that page when
delivering result lists (google & infoseek), quite another kind of algorithm
weights pages depending on
the number of people clicking on a specific link inside the
resulting lists (hotbot). You'll also discover that
the real purpose of all these "free" search engines is -quite simply- to gather
data about you.
Then
you will have to master some other useful search techniques, like combing and klebing
Fravia's "three steps" in seeking information: searching, combing, klebing.
you search the info yourself
you search people that have already searched that info
(and are willingly giving what they have gathered)
you lure people
that have already searched that info (and hare hoarding their knowledge)
There is a lot more, of course: you will learn that
there are many "webs inside the web", and that some of them are incredibly valuable, you will
realize
that the 'form' of the web is a very curious moebius knot, and that
some important loops are almost unlinked (yet at times very valuable), whereby there
are other loops of the huge knot that are heavily linked
(yet mostly worthless). Then you will have to learn how
to build and program your own bots (and how to reverse the bots and the algorithms used by
others' spiders :-). Then you will have to understand some half-forgotten old searching
lore. Finally you'll have to develop what I would call a "zen" feeling when
searching. There is a lot to learn.
The web is uncharted and deep. You are embarking here on a very
long voyage, at the end
you'll be what I call, lacking a better definition, a good seeker,
and you will
probably be able to find anything you are
looking for on the web.
Psychological and visual aspects
The web is uncharted and deep. The numbers of parameters you should
take account of is quite vast. As funny as it may seem to non-experts,
there is a quite relevant 'psychological' aspect that should always be considered.
In fact
'what you are' heavily influences your success possibilities when you search. In simpler
words: zombies won't find nutting,
no matter how perfect the search engines they (could) use will be. In fact
the 'mode' of a person's affective behaviour determines
his own perseverance and
attention to detail, and therefore influences heavily
his search strategies and satisfaction and stress levels, which as you'll see below are
far from being irrelevant when searching.
Note also that contrarily to common
wisdom, the information on the web is NOT really disorganized (if it really were, we wouldn't be able to
find it, duh): it is organized along patterns
that you will slowly discern (search engines being only the most common known of them). Some
patterns are format-specific (pdf files, for instance), others are related to the
paths you are following in order to find the information (usenet for instance).
Also note that "a visual
presentation" of your search results can be utterly
fascinating (even if scientifically
discutible).
See: some people are 'field-dependent' and need a lot of context
or they get lost. Others are very visual, and prefer spatial
organization of information. If - in order to understand what I mean - you want to see a "visualised" presentation
of the results of a query, you may like to visit
Cartia (http://www.cartia.com/),
and
newsmaps (http://www.newsmaps.com/), which offer
examples of clustered visual displays of retrieved
information like the [following
snapshot of
"world news"] taken on 8 March 2000).
Note that the visualization of search results
can be intended in two different ways.
One is simply to use graphical elements carefully in displaying
results of searches, the other is attempting to display the
search results graphically, in two or three dimensions,
grouped by topic or category. Both approaches are meant
to take advantage of the human capacity to process visual
information quickly and efficiently. Good visualization
should integrate the natural and technical world, use
natural intuition, spatial cues and perception.
It is probably also useful, going back to the 'psychological' aspects of
searching, to discern among different groups. A searcher may search
better (or worse) depending on his experience in the particular
topical area, on his age and cognitive style, on his technical aptitude and
on his specific personality type. Don't underestimate the importance
of the emotional (and motivational) involvement of the searcher: when seeking there is
(luckily :-) a fairly
DIRECT relationship between a greedy & provincial attitude
and search failures, AND
between an altruistic,
non-petty attitude
and success in the query. Even anxiety plays a role when searching,
believe it or not:
stress levels diminish when searchers find results, duh.
On the other hand the level of stress increases the longer the
people search wading through the noise and the more web sites they have
visited without approaching
the signal. I hope you'll be able to diminish your stress levels after having read the
info provided on my site :-)
Few people know how to search, and even less know how to find
what they have searched for
~S~ +Alistair
How to proceed
This site of mine is fairly didactic. Hence you'll find an "introductory" section, that could
be of some use before entering the "basic" section, that I advise you
to peruse "in the depth". Then you'll find my "advanced" section, where various
techniques (like combing and klebing) will be examined. The advanced section should also
help those among you that wish to
"specialise" in one of the many seekers' disciplines. Then there is a "classroom"
section, with a
series of "in fieri" labs and classes. As a "side section" you'll be able to visit
the "other stuff"
section, where I have crammed all sort of knowledges I find important for a seeker,
even if not immediately seeking-related. Finally you'll have - of course - to proceed on
your own,
leaving my site through the
"farewell" section.
I sincerely hope you'll learn enough to be able to contribute
with your own essays, observations, searchstrings, tricks & hints.
Building on each others' shoulders we'll reach new heights.
A word about images: you'll notice that while most pages
load very quickly because there are almost NO pictures
on them (the contents, not the frills, make
this site the knowledge treasure it is intended to be) I have chosen
some 'heavy loading' images for the main sections. You are well advised to
ditch your iexplorer or netscape and switch to Opera, a fantastically
quick browser (only a million bytes) that will give you the option to
AVOID loading images -on the fly- whenever you decide. It is difficult
to overestate how important this is
on the commercially spammed to-day web. A whole section of my site is dedicated to
the choice of the best browser, the sine qua non tool for any seeker.
There are many pages on my site... you are not compelled to follow
any logical path. You may peruse everything at
will, you are welcome. Do not be scared, nor paralysed if you don't understand
everything immediately, knowledge is like one of the chill white wines
bottled in the old lagoons I come from: you should sip it slowly
and knowingly, else it won't do you no good.
And now, after having read all the above if you decide to start, go back to the
[entrance] and click on the main logo.